Artificial Intelligence 106
☆ On the Surprising Effectiveness of Attention Transfer for Vision Transformers NeurIPS 2024
Conventional wisdom suggests that pre-training Vision Transformers (ViT)
improves downstream performance by learning useful representations. Is this
actually true? We investigate this question and find that the features and
representations learned during pre-training are not essential. Surprisingly,
using only the attention patterns from pre-training (i.e., guiding how
information flows between tokens) is sufficient for models to learn high
quality features from scratch and achieve comparable downstream performance. We
show this by introducing a simple method called attention transfer, where only
the attention patterns from a pre-trained teacher ViT are transferred to a
student, either by copying or distilling the attention maps. Since attention
transfer lets the student learn its own features, ensembling it with a
fine-tuned teacher also further improves accuracy on ImageNet. We
systematically study various aspects of our findings on the sufficiency of
attention maps, including distribution shift settings where they underperform
fine-tuning. We hope our exploration provides a better understanding of what
pre-training accomplishes and leads to a useful alternative to the standard
practice of fine-tuning
comment: NeurIPS 2024. Code:
https://github.com/alexlioralexli/attention-transfer
☆ LLM Hallucination Reasoning with Zero-shot Knowledge Test
LLM hallucination, where LLMs occasionally generate unfaithful text, poses
significant challenges for their practical applications. Most existing
detection methods rely on external knowledge, LLM fine-tuning, or
hallucination-labeled datasets, and they do not distinguish between different
types of hallucinations, which are crucial for improving detection performance.
We introduce a new task, Hallucination Reasoning, which classifies
LLM-generated text into one of three categories: aligned, misaligned, and
fabricated. Our novel zero-shot method assesses whether LLM has enough
knowledge about a given prompt and text. Our experiments conducted on new
datasets demonstrate the effectiveness of our method in hallucination reasoning
and underscore its importance for enhancing detection performance.
comment: 12 pages, 2 figures
☆ Towards a Classification of Open-Source ML Models and Datasets for Software Engineering
Background: Open-Source Pre-Trained Models (PTMs) and datasets provide
extensive resources for various Machine Learning (ML) tasks, yet these
resources lack a classification tailored to Software Engineering (SE) needs.
Aims: We apply an SE-oriented classification to PTMs and datasets on a popular
open-source ML repository, Hugging Face (HF), and analyze the evolution of PTMs
over time. Method: We conducted a repository mining study. We started with a
systematically gathered database of PTMs and datasets from the HF API. Our
selection was refined by analyzing model and dataset cards and metadata, such
as tags, and confirming SE relevance using Gemini 1.5 Pro. All analyses are
replicable, with a publicly accessible replication package. Results: The most
common SE task among PTMs and datasets is code generation, with a primary focus
on software development and limited attention to software management. Popular
PTMs and datasets mainly target software development. Among ML tasks, text
generation is the most common in SE PTMs and datasets. There has been a marked
increase in PTMs for SE since 2023 Q2. Conclusions: This study underscores the
need for broader task coverage to enhance the integration of ML within SE
practices.
comment: 5 pages, 8 figures
☆ NeuralDEM - Real-time Simulation of Industrial Particulate Flows
Benedikt Alkin, Tobias Kronlachner, Samuele Papa, Stefan Pirker, Thomas Lichtenegger, Johannes Brandstetter
Advancements in computing power have made it possible to numerically simulate
large-scale fluid-mechanical and/or particulate systems, many of which are
integral to core industrial processes. Among the different numerical methods
available, the discrete element method (DEM) provides one of the most accurate
representations of a wide range of physical systems involving granular and
discontinuous materials. Consequently, DEM has become a widely accepted
approach for tackling engineering problems connected to granular flows and
powder mechanics. Additionally, DEM can be integrated with grid-based
computational fluid dynamics (CFD) methods, enabling the simulation of chemical
processes taking place, e.g., in fluidized beds. However, DEM is
computationally intensive because of the intrinsic multiscale nature of
particulate systems, restricting simulation duration or number of particles.
Towards this end, NeuralDEM presents an end-to-end approach to replace slow
numerical DEM routines with fast, adaptable deep learning surrogates. NeuralDEM
is capable of picturing long-term transport processes across different regimes
using macroscopic observables without any reference to microscopic model
parameters. First, NeuralDEM treats the Lagrangian discretization of DEM as an
underlying continuous field, while simultaneously modeling macroscopic behavior
directly as additional auxiliary fields. Second, NeuralDEM introduces
multi-branch neural operators scalable to real-time modeling of
industrially-sized scenarios - from slow and pseudo-steady to fast and
transient. Such scenarios have previously posed insurmountable challenges for
deep learning models. Notably, NeuralDEM faithfully models coupled CFD-DEM
fluidized bed reactors of 160k CFD cells and 500k DEM particles for
trajectories of 28s. NeuralDEM will open many new doors to advanced engineering
and much faster process cycles.
comment: Project page: https://nx-ai.github.io/NeuralDEM/
☆ Med-Bot: An AI-Powered Assistant to Provide Accurate and Reliable Medical Information
This paper introduces Med-Bot, an AI-powered chatbot designed to provide
users with accurate and reliable medical information. Utilizing advanced
libraries and frameworks such as PyTorch, Chromadb, Langchain and Autogptq,
Med-Bot is built to handle the complexities of natural language understanding
in a healthcare context. The integration of llamaassisted data processing and
AutoGPT-Q provides enhanced performance in processing and responding to queries
based on PDFs of medical literature, ensuring that users receive precise and
trustworthy information. This research details the methodologies employed in
developing Med-Bot and evaluates its effectiveness in disseminating healthcare
information.
comment: 3 figures, 5 pages Keywords-LLM, AI-powered healthcare, Medical
chatbot, Context-based interaction, Llama-assisted data processing,
AutoGPT-Q, PyTorch, TensorFlow, Reliable medical information, Machine
learning in healthcare, Conversational AI
☆ On the Limits of Language Generation: Trade-Offs Between Hallucination and Mode Collapse
Specifying all desirable properties of a language model is challenging, but
certain requirements seem essential. Given samples from an unknown language,
the trained model should produce valid strings not seen in training and be
expressive enough to capture the language's full richness. Otherwise,
outputting invalid strings constitutes "hallucination," and failing to capture
the full range leads to "mode collapse." We ask if a language model can meet
both requirements.
We investigate this within a statistical language generation setting building
on Gold and Angluin. Here, the model receives random samples from a
distribution over an unknown language K, which belongs to a possibly infinite
collection of languages. The goal is to generate unseen strings from K. We say
the model generates from K with consistency and breadth if, as training size
increases, its output converges to all unseen strings in K.
Kleinberg and Mullainathan [KM24] asked if consistency and breadth in
language generation are possible. We answer this negatively: for a large class
of language models, including next-token prediction models, this is impossible
for most collections of candidate languages. This contrasts with [KM24]'s
result, showing consistent generation without breadth is possible for any
countable collection of languages. Our finding highlights that generation with
breadth fundamentally differs from generation without breadth.
As a byproduct, we establish near-tight bounds on the number of samples
needed for generation with or without breadth.
Finally, our results offer hope: consistent generation with breadth is
achievable for any countable collection of languages when negative examples
(strings outside K) are available alongside positive ones. This suggests that
post-training feedback, which encodes negative examples, can be crucial in
reducing hallucinations while limiting mode collapse.
comment: Abstract shortened to fit arXiv limit
☆ One-Shot Manipulation Strategy Learning by Making Contact Analogies
We present a novel approach, MAGIC (manipulation analogies for generalizable
intelligent contacts), for one-shot learning of manipulation strategies with
fast and extensive generalization to novel objects. By leveraging a reference
action trajectory, MAGIC effectively identifies similar contact points and
sequences of actions on novel objects to replicate a demonstrated strategy,
such as using different hooks to retrieve distant objects of different shapes
and sizes. Our method is based on a two-stage contact-point matching process
that combines global shape matching using pretrained neural features with local
curvature analysis to ensure precise and physically plausible contact points.
We experiment with three tasks including scooping, hanging, and hooking
objects. MAGIC demonstrates superior performance over existing methods,
achieving significant improvements in runtime speed and generalization to
different object categories. Website: https://magic-2024.github.io/ .
comment: CoRL LEAP Workshop, 2024
☆ Vision-based Manipulation of Transparent Plastic Bags in Industrial Setups
F. Adetunji, A. Karukayil, P. Samant, S. Shabana, F. Varghese, U. Upadhyay, R. A. Yadav, A. Partridge, E. Pendleton, R. Plant, Y. Petillot, M. Koskinopoulou
This paper addresses the challenges of vision-based manipulation for
autonomous cutting and unpacking of transparent plastic bags in industrial
setups, aligning with the Industry 4.0 paradigm. Industry 4.0, driven by data,
connectivity, analytics, and robotics, promises enhanced accessibility and
sustainability throughout the value chain. The integration of autonomous
systems, including collaborative robots (cobots), into industrial processes is
pivotal for efficiency and safety. The proposed solution employs advanced
Machine Learning algorithms, particularly Convolutional Neural Networks (CNNs),
to identify transparent plastic bags under varying lighting and background
conditions. Tracking algorithms and depth sensing technologies are utilized for
3D spatial awareness during pick and placement. The system addresses challenges
in grasping and manipulation, considering optimal points, compliance control
with vacuum gripping technology, and real-time automation for safe interaction
in dynamic environments. The system's successful testing and validation in the
lab with the FRANKA robot arm, showcases its potential for widespread
industrial applications, while demonstrating effectiveness in automating the
unpacking and cutting of transparent plastic bags for an 8-stack bulk-loader
based on specific requirements and rigorous testing.
☆ PTR: Precision-Driven Tool Recommendation for Large Language Models
By augmenting Large Language Models (LLMs) with external tools, their
capacity to solve complex problems has been significantly enhanced. However,
despite ongoing advancements in the parsing capabilities of LLMs, incorporating
all available tools simultaneously in the prompt remains impractical due to the
vast number of external tools. Consequently, it is essential to provide LLMs
with a precise set of tools tailored to the specific task, considering both
quantity and quality. Current tool retrieval methods primarily focus on
refining the ranking list of tools and directly packaging a fixed number of
top-ranked tools as the tool set. However, these approaches often fail to equip
LLMs with the optimal set of tools prior to execution, since the optimal number
of tools for different tasks could be different, resulting in inefficiencies
such as redundant or unsuitable tools, which impede immediate access to the
most relevant tools. This paper addresses the challenge of recommending precise
toolsets for LLMs. We introduce the problem of tool recommendation, define its
scope, and propose a novel Precision-driven Tool Recommendation (PTR) approach.
PTR captures an initial, concise set of tools by leveraging historical tool
bundle usage and dynamically adjusts the tool set by performing tool matching,
culminating in a multi-view-based tool addition. Additionally, we present a new
dataset, RecTools, and a metric, TRACC, designed to evaluate the effectiveness
of tool recommendation for LLMs. We further validate our design choices through
comprehensive experiments, demonstrating promising accuracy across two open
benchmarks and our RecTools dataset.
☆ Local-Global Attention: An Adaptive Mechanism for Multi-Scale Feature Integration
In recent years, attention mechanisms have significantly enhanced the
performance of object detection by focusing on key feature information.
However, prevalent methods still encounter difficulties in effectively
balancing local and global features. This imbalance hampers their ability to
capture both fine-grained details and broader contextual information-two
critical elements for achieving accurate object detection.To address these
challenges, we propose a novel attention mechanism, termed Local-Global
Attention, which is designed to better integrate both local and global
contextual features. Specifically, our approach combines multi-scale
convolutions with positional encoding, enabling the model to focus on local
details while concurrently considering the broader global context.
Additionally, we introduce a learnable parameters, which allow the model to
dynamically adjust the relative importance of local and global attention,
depending on the specific requirements of the task, thereby optimizing feature
representations across multiple scales.We have thoroughly evaluated the
Local-Global Attention mechanism on several widely used object detection and
classification datasets. Our experimental results demonstrate that this
approach significantly enhances the detection of objects at various scales,
with particularly strong performance on multi-class and small object detection
tasks. In comparison to existing attention mechanisms, Local-Global Attention
consistently outperforms them across several key metrics, all while maintaining
computational efficiency.
☆ Accelerating Knowledge Graph and Ontology Engineering with Large Language Models
Large Language Models bear the promise of significant acceleration of key
Knowledge Graph and Ontology Engineering tasks, including ontology modeling,
extension, modification, population, alignment, as well as entity
disambiguation. We lay out LLM-based Knowledge Graph and Ontology Engineering
as a new and coming area of research, and argue that modular approaches to
ontologies will be of central importance.
☆ LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models
This work explores expanding the capabilities of large language models (LLMs)
pretrained on text to generate 3D meshes within a unified model. This offers
key advantages of (1) leveraging spatial knowledge already embedded in LLMs,
derived from textual sources like 3D tutorials, and (2) enabling conversational
3D generation and mesh understanding. A primary challenge is effectively
tokenizing 3D mesh data into discrete tokens that LLMs can process seamlessly.
To address this, we introduce LLaMA-Mesh, a novel approach that represents the
vertex coordinates and face definitions of 3D meshes as plain text, allowing
direct integration with LLMs without expanding the vocabulary. We construct a
supervised fine-tuning (SFT) dataset enabling pretrained LLMs to (1) generate
3D meshes from text prompts, (2) produce interleaved text and 3D mesh outputs
as required, and (3) understand and interpret 3D meshes. Our work is the first
to demonstrate that LLMs can be fine-tuned to acquire complex spatial knowledge
for 3D mesh generation in a text-based format, effectively unifying the 3D and
text modalities. LLaMA-Mesh achieves mesh generation quality on par with models
trained from scratch while maintaining strong text generation performance.
comment: See the project website at
https://research.nvidia.com/labs/toronto-ai/LLaMA-Mesh/
☆ SMILE-UHURA Challenge -- Small Vessel Segmentation at Mesoscopic Scale from Ultra-High Resolution 7T Magnetic Resonance Angiograms
Soumick Chatterjee, Hendrik Mattern, Marc Dörner, Alessandro Sciarra, Florian Dubost, Hannes Schnurre, Rupali Khatun, Chun-Chih Yu, Tsung-Lin Hsieh, Yi-Shan Tsai, Yi-Zeng Fang, Yung-Ching Yang, Juinn-Dar Huang, Marshall Xu, Siyu Liu, Fernanda L. Ribeiro, Saskia Bollmann, Karthikesh Varma Chintalapati, Chethan Mysuru Radhakrishna, Sri Chandana Hudukula Ram Kumara, Raviteja Sutrave, Abdul Qayyum, Moona Mazher, Imran Razzak, Cristobal Rodero, Steven Niederren, Fengming Lin, Yan Xia, Jiacheng Wang, Riyu Qiu, Liansheng Wang, Arya Yazdan Panah, Rosana El Jurdi, Guanghui Fu, Janan Arslan, Ghislain Vaillant, Romain Valabregue, Didier Dormont, Bruno Stankoff, Olivier Colliot, Luisa Vargas, Isai Daniel Chacón, Ioannis Pitsiorlas, Pablo Arbeláez, Maria A. Zuluaga, Stefanie Schreiber, Oliver Speck, Andreas Nürnberger
The human brain receives nutrients and oxygen through an intricate network of
blood vessels. Pathology affecting small vessels, at the mesoscopic scale,
represents a critical vulnerability within the cerebral blood supply and can
lead to severe conditions, such as Cerebral Small Vessel Diseases. The advent
of 7 Tesla MRI systems has enabled the acquisition of higher spatial resolution
images, making it possible to visualise such vessels in the brain. However, the
lack of publicly available annotated datasets has impeded the development of
robust, machine learning-driven segmentation algorithms. To address this, the
SMILE-UHURA challenge was organised. This challenge, held in conjunction with
the ISBI 2023, in Cartagena de Indias, Colombia, aimed to provide a platform
for researchers working on related topics. The SMILE-UHURA challenge addresses
the gap in publicly available annotated datasets by providing an annotated
dataset of Time-of-Flight angiography acquired with 7T MRI. This dataset was
created through a combination of automated pre-segmentation and extensive
manual refinement. In this manuscript, sixteen submitted methods and two
baseline methods are compared both quantitatively and qualitatively on two
different datasets: held-out test MRAs from the same dataset as the training
data (with labels kept secret) and a separate 7T ToF MRA dataset where both
input volumes and labels are kept secret. The results demonstrate that most of
the submitted deep learning methods, trained on the provided training dataset,
achieved reliable segmentation performance. Dice scores reached up to 0.838
$\pm$ 0.066 and 0.716 $\pm$ 0.125 on the respective datasets, with an average
performance of up to 0.804 $\pm$ 0.15.
☆ Adopting RAG for LLM-Aided Future Vehicle Design
In this paper, we explore the integration of Large Language Models (LLMs)
with Retrieval-Augmented Generation (RAG) to enhance automated design and
software development in the automotive industry. We present two case studies: a
standardization compliance chatbot and a design copilot, both utilizing RAG to
provide accurate, context-aware responses. We evaluate four LLMs-GPT-4o,
LLAMA3, Mistral, and Mixtral- comparing their answering accuracy and execution
time. Our results demonstrate that while GPT-4 offers superior performance,
LLAMA3 and Mistral also show promising capabilities for local deployment,
addressing data privacy concerns in automotive applications. This study
highlights the potential of RAG-augmented LLMs in improving design workflows
and compliance in automotive engineering.
comment: Conference paper accepted in IEEE FLLM 2024
☆ Software Performance Engineering for Foundation Model-Powered Software (FMware)
Haoxiang Zhang, Shi Chang, Arthur Leung, Kishanthan Thangarajah, Boyuan Chen, Hanan Lutfiyya, Ahmed E. Hassan
The rise of Foundation Models (FMs) like Large Language Models (LLMs) is
revolutionizing software development. Despite the impressive prototypes,
transforming FMware into production-ready products demands complex engineering
across various domains. A critical but overlooked aspect is performance
engineering, which aims at ensuring FMware meets performance goals such as
throughput and latency to avoid user dissatisfaction and financial loss. Often,
performance considerations are an afterthought, leading to costly optimization
efforts post-deployment. FMware's high computational resource demands highlight
the need for efficient hardware use. Continuous performance engineering is
essential to prevent degradation. This paper highlights the significance of
Software Performance Engineering (SPE) in FMware, identifying four key
challenges: cognitive architecture design, communication protocols, tuning and
optimization, and deployment. These challenges are based on literature surveys
and experiences from developing an in-house FMware system. We discuss problems,
current practices, and innovative paths for the software engineering community.
☆ Automating Reformulation of Essence Specifications via Graph Rewriting
Formulating an effective constraint model of a parameterised problem class is
crucial to the efficiency with which instances of the class can subsequently be
solved. It is difficult to know beforehand which of a set of candidate models
will perform best in practice. This paper presents a system that employs graph
rewriting to reformulate an input model for improved performance automatically.
By situating our work in the Essence abstract constraint specification
language, we can use the structure in its high level variable types to trigger
rewrites directly. We implement our system via rewrite rules expressed in the
Graph Programs 2 language, applied to the abstract syntax tree of an input
specification. We show how to automatically translate the solution of the
reformulated problem into a solution of the original problem for verification
and presentation. We demonstrate the efficacy of our system with a detailed
case study.
comment: Presented at the PTHG 2024 workshop
☆ Piecing It All Together: Verifying Multi-Hop Multimodal Claims
Existing claim verification datasets often do not require systems to perform
complex reasoning or effectively interpret multimodal evidence. To address
this, we introduce a new task: multi-hop multimodal claim verification. This
task challenges models to reason over multiple pieces of evidence from diverse
sources, including text, images, and tables, and determine whether the combined
multimodal evidence supports or refutes a given claim. To study this task, we
construct MMCV, a large-scale dataset comprising 16k multi-hop claims paired
with multimodal evidence, generated and refined using large language models,
with additional input from human feedback. We show that MMCV is challenging
even for the latest state-of-the-art multimodal large language models,
especially as the number of reasoning hops increases. Additionally, we
establish a human performance benchmark on a subset of MMCV. We hope this
dataset and its evaluation task will encourage future research in multimodal
multi-hop claim verification.
☆ OpenGeMM: A High-Utilization GeMM Accelerator Generator with Lightweight RISC-V Control and Tight Memory Coupling
Xiaoling Yi, Ryan Antonio, Joren Dumoulin, Jiacong Sun, Josse Van Delm, Guilherme Paim, Marian Verhelst
Deep neural networks (DNNs) face significant challenges when deployed on
resource-constrained extreme edge devices due to their computational and
data-intensive nature. While standalone accelerators tailored for specific
application scenarios suffer from inflexible control and limited
programmability, generic hardware acceleration platforms coupled with RISC-V
CPUs can enable high reusability and flexibility, yet typically at the expense
of system level efficiency and low utilization. To fill this gap, we propose
OpenGeMM, an open-source acceleration platform, jointly demonstrating high
efficiency and utilization, as well as ease of configurability and
programmability. OpenGeMM encompasses a parameterized Chisel-coded GeMM
accelerator, a lightweight RISC-V processor, and a tightly coupled multi-banked
scratchpad memory. The GeMM core utilization and system efficiency are boosted
through three mechanisms: configuration pre-loading, input pre-fetching with
output buffering, and programmable strided memory access. Experimental results
show that OpenGeMM can consistently achieve hardware utilization ranging from
81.89% to 99.34% across diverse CNN and Transformer workloads. Compared to the
SotA open-source Gemmini accelerator, OpenGeMM demonstrates a 3.58x to 16.40x
speedup on normalized throughput across a wide variety ofGeMM workloads, while
achieving 4.68 TOPS/W system efficiency.
☆ Prompting the Unseen: Detecting Hidden Backdoors in Black-Box Models
Visual prompting (VP) is a new technique that adapts well-trained frozen
models for source domain tasks to target domain tasks. This study examines VP's
benefits for black-box model-level backdoor detection. The visual prompt in VP
maps class subspaces between source and target domains. We identify a
misalignment, termed class subspace inconsistency, between clean and poisoned
datasets. Based on this, we introduce \textsc{BProm}, a black-box model-level
detection method to identify backdoors in suspicious models, if any.
\textsc{BProm} leverages the low classification accuracy of prompted models
when backdoors are present. Extensive experiments confirm \textsc{BProm}'s
effectiveness.
☆ Navigating the Risks: A Survey of Security, Privacy, and Ethics Threats in LLM-Based Agents
Yuyou Gan, Yong Yang, Zhe Ma, Ping He, Rui Zeng, Yiming Wang, Qingming Li, Chunyi Zhou, Songze Li, Ting Wang, Yunjun Gao, Yingcai Wu, Shouling Ji
With the continuous development of large language models (LLMs),
transformer-based models have made groundbreaking advances in numerous natural
language processing (NLP) tasks, leading to the emergence of a series of agents
that use LLMs as their control hub. While LLMs have achieved success in various
tasks, they face numerous security and privacy threats, which become even more
severe in the agent scenarios. To enhance the reliability of LLM-based
applications, a range of research has emerged to assess and mitigate these
risks from different perspectives.
To help researchers gain a comprehensive understanding of various risks, this
survey collects and analyzes the different threats faced by these agents. To
address the challenges posed by previous taxonomies in handling cross-module
and cross-stage threats, we propose a novel taxonomy framework based on the
sources and impacts. Additionally, we identify six key features of LLM-based
agents, based on which we summarize the current research progress and analyze
their limitations. Subsequently, we select four representative agents as case
studies to analyze the risks they may face in practical use. Finally, based on
the aforementioned analyses, we propose future research directions from the
perspectives of data, methodology, and policy, respectively.
☆ Communication Compression for Tensor Parallel LLM Inference
Large Language Models (LLMs) have pushed the frontier of artificial
intelligence but are comprised of hundreds of billions of parameters and
operations. For faster inference latency, LLMs are deployed on multiple
hardware accelerators through various Model Parallelism strategies. Our paper
looks into the details on one such strategy - Tensor Parallel - and proposes to
reduce latency by compressing inter-accelerator communication. We leverage fine
grained quantization techniques to compress selected activations by 3.5 - 4.5x.
Our proposed method leads up to 2x reduction of time-to-first-token (TTFT) with
negligible model performance degradation.
☆ Toward a Cohesive AI and Simulation Software Ecosystem for Scientific Innovation
In this paper, we discuss the need for an integrated software stack that
unites artificial intelligence (AI) and modeling and simulation (ModSim) tools
to advance scientific discovery. The authors advocate for a unified AI/ModSim
software ecosystem that ensures compatibility across a wide range of software
on diverse high-performance computing systems, promoting ease of deployment,
version management, and binary distribution. Key challenges highlighted include
balancing the distinct needs of AI and ModSim, especially in terms of software
build practices, dependency management, and compatibility. The document
underscores the importance of continuous integration, community-driven
stewardship, and collaboration with the Department of Energy (DOE) to develop a
portable and cohesive scientific software ecosystem. Recommendations focus on
supporting standardized environments through initiatives like the Extreme-scale
Scientific Software Stack (E4S) and Spack to foster interdisciplinary
innovation and facilitate new scientific advancements.
comment: 5 pages
☆ MM-Eval: A Hierarchical Benchmark for Modern Mongolian Evaluation in LLMs
Large language models (LLMs) excel in high-resource languages but face
notable challenges in low-resource languages like Mongolian. This paper
addresses these challenges by categorizing capabilities into language abilities
(syntax and semantics) and cognitive abilities (knowledge and reasoning). To
systematically evaluate these areas, we developed MM-Eval, a specialized
dataset based on Modern Mongolian Language Textbook I and enriched with WebQSP
and MGSM datasets.
Preliminary experiments on models including Qwen2-7B-Instruct, GLM4-9b-chat,
Llama3.1-8B-Instruct, GPT-4, and DeepseekV2.5 revealed that: 1) all models
performed better on syntactic tasks than semantic tasks, highlighting a gap in
deeper language understanding; and 2) knowledge tasks showed a moderate
decline, suggesting that models can transfer general knowledge from
high-resource to low-resource contexts.
The release of MM-Eval, comprising 569 syntax, 677 semantics, 344 knowledge,
and 250 reasoning tasks, offers valuable insights for advancing NLP and LLMs in
low-resource languages like Mongolian. The dataset is available at
https://github.com/joenahm/MM-Eval.
☆ ResidualDroppath: Enhancing Feature Reuse over Residual Connections
Residual connections are one of the most important components in neural
network architectures for mitigating the vanishing gradient problem and
facilitating the training of much deeper networks. One possible explanation for
how residual connections aid deeper network training is by promoting feature
reuse. However, we identify and analyze the limitations of feature reuse with
vanilla residual connections. To address these limitations, we propose
modifications in training methods. Specifically, we provide an additional
opportunity for the model to learn feature reuse with residual connections
through two types of iterations during training. The first type of iteration
involves using droppath, which enforces feature reuse by randomly dropping a
subset of layers. The second type of iteration focuses on training the dropped
parts of the model while freezing the undropped parts. As a result, the dropped
parts learn in a way that encourages feature reuse, as the model relies on the
undropped parts with feature reuse in mind. Overall, we demonstrated
performance improvements in models with residual connections for image
classification in certain cases.
☆ Renal Cell Carcinoma subtyping: learning from multi-resolution localization
Renal Cell Carcinoma is typically asymptomatic at the early stages for many
patients. This leads to a late diagnosis of the tumor, where the curability
likelihood is lower, and makes the mortality rate of Renal Cell Carcinoma high,
with respect to its incidence rate. To increase the survival chance, a fast and
correct categorization of the tumor subtype is paramount. Nowadays,
computerized methods, based on artificial intelligence, represent an
interesting opportunity to improve the productivity and the objectivity of the
microscopy-based Renal Cell Carcinoma diagnosis. Nonetheless, much of their
exploitation is hampered by the paucity of annotated dataset, essential for a
proficient training of supervised machine learning technologies. This study
sets out to investigate a novel self supervised training strategy for machine
learning diagnostic tools, based on the multi-resolution nature of the
histological samples. We aim at reducing the need of annotated dataset, without
significantly reducing the accuracy of the tool. We demonstrate the
classification capability of our tool on a whole slide imaging dataset for
Renal Cancer subtyping, and we compare our solution with several
state-of-the-art classification counterparts.
☆ An Explainable Attention Model for Cervical Precancer Risk Classification using Colposcopic Images
Cervical cancer remains a major worldwide health issue, with early
identification and risk assessment playing critical roles in effective
preventive interventions. This paper presents the Cervix-AID-Net model for
cervical precancer risk classification. The study designs and evaluates the
proposed Cervix-AID-Net model based on patients colposcopy images. The model
comprises a Convolutional Block Attention Module (CBAM) and convolutional
layers that extract interpretable and representative features of colposcopic
images to distinguish high-risk and low-risk cervical precancer. In addition,
the proposed Cervix-AID-Net model integrates four explainable techniques,
namely gradient class activation maps, Local Interpretable Model-agnostic
Explanations, CartoonX, and pixel rate distortion explanation based on output
feature maps and input features. The evaluation using holdout and ten-fold
cross-validation techniques yielded a classification accuracy of 99.33\% and
99.81\%. The analysis revealed that CartoonX provides meticulous explanations
for the decision of the Cervix-AID-Net model due to its ability to provide the
relevant piece-wise smooth part of the image. The effect of Gaussian noise and
blur on the input shows that the performance remains unchanged up to Gaussian
noise of 3\% and blur of 10\%, while the performance reduces thereafter. A
comparison study of the proposed model's performance compared to other deep
learning approaches highlights the Cervix-AID-Net model's potential as a
supplemental tool for increasing the effectiveness of cervical precancer risk
assessment. The proposed method, which incorporates the CBAM and explainable
artificial integration, has the potential to influence cervical cancer
prevention and early detection, improving patient outcomes and lowering the
worldwide burden of this preventable disease.
comment: 19 pages, 9 figure, and 7 tables
☆ DiffRoad: Realistic and Diverse Road Scenario Generation for Autonomous Vehicle Testing
Generating realistic and diverse road scenarios is essential for autonomous
vehicle testing and validation. Nevertheless, owing to the complexity and
variability of real-world road environments, creating authentic and varied
scenarios for intelligent driving testing is challenging. In this paper, we
propose DiffRoad, a novel diffusion model designed to produce controllable and
high-fidelity 3D road scenarios. DiffRoad leverages the generative capabilities
of diffusion models to synthesize road layouts from white noise through an
inverse denoising process, preserving real-world spatial features. To enhance
the quality of generated scenarios, we design the Road-UNet architecture,
optimizing the balance between backbone and skip connections for high-realism
scenario generation. Furthermore, we introduce a road scenario evaluation
module that screens adequate and reasonable scenarios for intelligent driving
testing using two critical metrics: road continuity and road reasonableness.
Experimental results on multiple real-world datasets demonstrate DiffRoad's
ability to generate realistic and smooth road structures while maintaining the
original distribution. Additionally, the generated scenarios can be fully
automated into the OpenDRIVE format, facilitating generalized autonomous
vehicle simulation testing. DiffRoad provides a rich and diverse scenario
library for large-scale autonomous vehicle testing and offers valuable insights
for future infrastructure designs that are better suited for autonomous
vehicles.
comment: 14 pages, 9 figures
☆ AI-driven inverse design of materials: Past, present and future
Xiao-Qi Han, Xin-De Wang, Meng-Yuan Xu, Zhen Feng, Bo-Wen Yao, Peng-Jie Guo, Ze-Feng Gao, Zhong-Yi Lu
The discovery of advanced materials is the cornerstone of human technological
development and progress. The structures of materials and their corresponding
properties are essentially the result of a complex interplay of multiple
degrees of freedom such as lattice, charge, spin, symmetry, and topology. This
poses significant challenges for the inverse design methods of materials.
Humans have long explored new materials through a large number of experiments
and proposed corresponding theoretical systems to predict new material
properties and structures. With the improvement of computational power,
researchers have gradually developed various electronic structure calculation
methods, particularly such as the one based density functional theory, as well
as high-throughput computational methods. Recently, the rapid development of
artificial intelligence technology in the field of computer science has enabled
the effective characterization of the implicit association between material
properties and structures, thus opening up an efficient paradigm for the
inverse design of functional materials. A significant progress has been made in
inverse design of materials based on generative and discriminative models,
attracting widespread attention from researchers. Considering this rapid
technological progress, in this survey, we look back on the latest advancements
in AI-driven inverse design of materials by introducing the background, key
findings, and mainstream technological development routes. In addition, we
summarize the remaining issues for future directions. This survey provides the
latest overview of AI-driven inverse design of materials, which can serve as a
useful resource for researchers.
comment: 43 pages, 5 figures, 2 tables
☆ An Adaptive Open-Source Dataset Generation Framework for Machine Learning Tasks in Logic Synthesis
Liwei Ni, Rui Wang, Miao Liu, Xingyu Meng, Xiaoze Lin, Junfeng Liu, Guojie Luo, Zhufei Chu, Weikang Qian, Xiaoyan Yang, Biwei Xie, Xingquan Li, Huawei Li
This paper introduces an adaptive logic synthesis dataset generation
framework designed to enhance machine learning applications within the logic
synthesis process. Unlike previous dataset generation flows that were tailored
for specific tasks or lacked integrated machine learning capabilities, the
proposed framework supports a comprehensive range of machine learning tasks by
encapsulating the three fundamental steps of logic synthesis: Boolean
representation, logic optimization, and technology mapping. It preserves the
original information in the intermediate files that can be stored in both
Verilog and Graphmal format. Verilog files enable semi-customizability,
allowing researchers to add steps and incrementally refine the generated
dataset. The framework also includes an adaptive circuit engine to facilitate
the loading of GraphML files for final dataset packaging and sub-dataset
extraction. The generated OpenLS-D dataset comprises 46 combinational designs
from established benchmarks, totaling over 966,000 Boolean circuits, with each
design containing 21,000 circuits generated from 1000 synthesis recipes,
including 7000 Boolean networks, 7000 ASIC netlists, and 7000 FPGA netlists.
Furthermore, OpenLS-D supports integrating newly desired data features, making
it more versatile for new challenges. The utility of OpenLS-D is demonstrated
through four distinct downstream tasks: circuit classification, circuit
ranking, quality of results (QoR) prediction, and probability prediction. Each
task highlights different internal steps of logic synthesis, with the datasets
extracted and relabeled from the OpenLS-D dataset using the circuit engine. The
experimental results confirm the dataset's diversity and extensive
applicability. The source code and datasets are available at
https://github.com/Logic-Factory/ACE/blob/master/OpenLS-D/readme.md.
comment: 14 pages
☆ SAG-ViT: A Scale-Aware, High-Fidelity Patching Approach with Graph Attention for Vision Transformers
Image classification is a computer vision task where a model analyzes an
image to categorize it into a specific label. Vision Transformers (ViT) improve
this task by leveraging self-attention to capture complex patterns and long
range relationships between image patches. However, a key challenge for ViTs is
efficiently incorporating multiscale feature representations, which is inherent
in CNNs through their hierarchical structure. In this paper, we introduce the
Scale-Aware Graph Attention Vision Transformer (SAG-ViT), a novel framework
that addresses this challenge by integrating multi-scale features. Using
EfficientNet as a backbone, the model extracts multi-scale feature maps, which
are divided into patches to preserve semantic information. These patches are
organized into a graph based on spatial and feature similarities, with a Graph
Attention Network (GAT) refining the node embeddings. Finally, a Transformer
encoder captures long-range dependencies and complex interactions. The SAG-ViT
is evaluated on benchmark datasets, demonstrating its effectiveness in
enhancing image classification performance.
comment: 10 pages, 4 figures, 3 tables
☆ Script-centric behavior understanding for assisted autism spectrum disorder diagnosis ICASSP 2025
Observing and analyzing children's social behaviors is crucial for the early
diagnosis of Autism Spectrum Disorders (ASD). This work focuses on
automatically detecting ASD using computer vision techniques and large language
models (LLMs). Existing methods typically rely on supervised learning. However,
the scarcity of ASD diagnostic datasets and the lack of interpretability in
diagnostic results significantly limits its clinical application. To address
these challenges, we introduce a novel unsupervised approach based on
script-centric behavior understanding. Our pipeline converts video content into
scripts that describe the behavior of characters, leveraging the
generalizability of large language models to detect ASD in a zero-shot or
few-shot manner. Specifically, we propose a scripts transcription module for
multimodal behavior data textualization and a domain prompts module to bridge
LLMs. Our method achieves an accuracy of 92.00\% in diagnosing ASD in children
with an average age of 24 months, surpassing the performance of supervised
learning methods by 3.58\% absolutely. Extensive experiments confirm the
effectiveness of our approach and suggest its potential for advancing ASD
research through LLMs.
comment: 5 pages, 4 figures, submitted to ICASSP 2025
☆ Quantum Machine Learning: An Interplay Between Quantum Computing and Machine Learning
Quantum machine learning (QML) is a rapidly growing field that combines
quantum computing principles with traditional machine learning. It seeks to
revolutionize machine learning by harnessing the unique capabilities of quantum
mechanics and employs machine learning techniques to advance quantum computing
research. This paper introduces quantum computing for the machine learning
paradigm, where variational quantum circuits (VQC) are used to develop QML
architectures on noisy intermediate-scale quantum (NISQ) devices. We discuss
machine learning for the quantum computing paradigm, showcasing our recent
theoretical and empirical findings. In particular, we delve into future
directions for studying QML, exploring the potential industrial impacts of QML
research.
comment: In submission
☆ Automated Segmentation of Ischemic Stroke Lesions in Non-Contrast Computed Tomography Images for Enhanced Treatment and Prognosis MICCAI
Stroke is the second leading cause of death worldwide, and is increasingly
prevalent in low- and middle-income countries (LMICs). Timely interventions can
significantly influence stroke survivability and the quality of life after
treatment. However, the standard and most widely available imaging method for
confirming strokes and their sub-types, the NCCT, is more challenging and
time-consuming to employ in cases of ischemic stroke. For this reason, we
developed an automated method for ischemic stroke lesion segmentation in NCCTs
using the nnU-Net frame work, aimed at enhancing early treatment and improving
the prognosis of ischemic stroke patients. We achieved Dice scores of 0.596 and
Intersection over Union (IoU) scores of 0.501 on the sampled dataset. After
adjusting for outliers, these scores improved to 0.752 for the Dice score and
0.643 for the IoU. Proper delineation of the region of infarction can help
clinicians better assess the potential impact of the infarction, and guide
treatment procedures.
comment: 7 pages, 3 figures, MICCAI Meets Africa Workshop
☆ Imagined Speech and Visual Imagery as Intuitive Paradigms for Brain-Computer Interfaces
Recent advancements in brain-computer interface (BCI) technology have
emphasized the promise of imagined speech and visual imagery as effective
paradigms for intuitive communication. This study investigates the
classification performance and brain connectivity patterns associated with
these paradigms, focusing on decoding accuracy across selected word classes.
Sixteen participants engaged in tasks involving thirteen imagined speech and
visual imagery classes, revealing above-chance classification accuracy for both
paradigms. Variability in classification accuracy across individual classes
highlights the influence of sensory and motor associations in imagined speech
and vivid visual associations in visual imagery. Connectivity analysis further
demonstrated increased functional connectivity in language-related and sensory
regions for imagined speech, whereas visual imagery activated spatial and
visual processing networks. These findings suggest the potential of imagined
speech and visual imagery as an intuitive and scalable paradigm for BCI
communication when selecting optimal word classes. Further exploration of the
decoding outcomes for these two paradigms could provide insights for practical
BCI communication.
comment: 4 pages
☆ Less is More: Unseen Domain Fake News Detection via Causal Propagation Substructures
The spread of fake news on social media poses significant threats to
individuals and society. Text-based and graph-based models have been employed
for fake news detection by analysing news content and propagation networks,
showing promising results in specific scenarios. However, these data-driven
models heavily rely on pre-existing in-distribution data for training, limiting
their performance when confronted with fake news from emerging or previously
unseen domains, known as out-of-distribution (OOD) data. Tackling OOD fake news
is a challenging yet critical task. In this paper, we introduce the Causal
Subgraph-oriented Domain Adaptive Fake News Detection (CSDA) model, designed to
enhance zero-shot fake news detection by extracting causal substructures from
propagation graphs using in-distribution data and generalising this approach to
OOD data. The model employs a graph neural network based mask generation
process to identify dominant nodes and edges within the propagation graph,
using these substructures for fake news detection. Additionally, the
performance of CSDA is further improved through contrastive learning in
few-shot scenarios, where a limited amount of OOD data is available for
training. Extensive experiments on public social media datasets demonstrate
that CSDA effectively handles OOD fake news detection, achieving a 7 to 16
percents accuracy improvement over other state-of-the-art models.
comment: 9 pages, 2 figures, 5 tables
☆ LTLf+ and PPLTL+: Extending LTLf and PPLTL to Infinite Traces
We introduce LTLf+ and PPLTL+, two logics to express properties of infinite
traces, that are based on the linear-time temporal logics LTLf and PPLTL on
finite traces. LTLf+/PPLTL+ use levels of Manna and Pnueli's LTL
safety-progress hierarchy, and thus have the same expressive power as LTL.
However, they also retain a crucial characteristic of the reactive synthesis
problem for the base logics: the game arena for strategy extraction can be
derived from deterministic finite automata (DFA). Consequently, these logics
circumvent the notorious difficulties associated with determinizing infinite
trace automata, typical of LTL reactive synthesis. We present DFA-based
synthesis techniques for LTLf+/PPLTL+, and show that synthesis is
2EXPTIME-complete for LTLf+ (matching LTLf) and EXPTIME-complete for PPLTL+
(matching PPLTL). Notably, while PPLTL+ retains the full expressive power of
LTL, reactive synthesis is EXPTIME-complete instead of 2EXPTIME-complete. The
techniques are also adapted to optimally solve satisfiability, validity, and
model-checking, to get EXPSPACE-complete for LTLf+ (extending a recent result
for the guarantee level using LTLf), and PSPACE-complete for PPLTL+.
☆ Your Fixed Watermark is Fragile: Towards Semantic-Aware Watermark for EaaS Copyright Protection
Embedding-as-a-Service (EaaS) has emerged as a successful business pattern
but faces significant challenges related to various forms of copyright
infringement, including API misuse and different attacks. Various studies have
proposed backdoor-based watermarking schemes to protect the copyright of EaaS
services. In this paper, we reveal that previous watermarking schemes possess
semantic-independent characteristics and propose the Semantic Perturbation
Attack (SPA). Our theoretical and experimental analyses demonstrate that this
semantic-independent nature makes current watermarking schemes vulnerable to
adaptive attacks that exploit semantic perturbations test to bypass watermark
verification. To address this vulnerability, we propose the Semantic Aware
Watermarking (SAW) scheme, a robust defense mechanism designed to resist SPA,
by injecting a watermark that adapts to the text semantics. Extensive
experimental results across multiple datasets demonstrate that the True
Positive Rate (TPR) for detecting watermarked samples under SPA can reach up to
more than 95%, rendering previous watermarks ineffective. Meanwhile, our
watermarking scheme can resist such attack while ensuring the watermark
verification capability. Our code is available at
https://github.com/Zk4-ps/EaaS-Embedding-Watermark.
☆ Multi-scale Generative Modeling for Fast Sampling
Xiongye Xiao, Shixuan Li, Luzhe Huang, Gengshuo Liu, Trung-Kien Nguyen, Yi Huang, Di Chang, Mykel J. Kochenderfer, Paul Bogdan
While working within the spatial domain can pose problems associated with
ill-conditioned scores caused by power-law decay, recent advances in
diffusion-based generative models have shown that transitioning to the wavelet
domain offers a promising alternative. However, within the wavelet domain, we
encounter unique challenges, especially the sparse representation of
high-frequency coefficients, which deviates significantly from the Gaussian
assumptions in the diffusion process. To this end, we propose a multi-scale
generative modeling in the wavelet domain that employs distinct strategies for
handling low and high-frequency bands. In the wavelet domain, we apply
score-based generative modeling with well-conditioned scores for low-frequency
bands, while utilizing a multi-scale generative adversarial learning for
high-frequency bands. As supported by the theoretical analysis and experimental
results, our model significantly improve performance and reduce the number of
trainable parameters, sampling steps, and time.
☆ EEG-Based Speech Decoding: A Novel Approach Using Multi-Kernel Ensemble Diffusion Models
In this study, we propose an ensemble learning framework for
electroencephalogram-based overt speech classification, leveraging denoising
diffusion probabilistic models with varying convolutional kernel sizes. The
ensemble comprises three models with kernel sizes of 51, 101, and 201,
effectively capturing multi-scale temporal features inherent in signals. This
approach improves the robustness and accuracy of speech decoding by
accommodating the rich temporal complexity of neural signals. The ensemble
models work in conjunction with conditional autoencoders that refine the
reconstructed signals and maximize the useful information for downstream
classification tasks. The results indicate that the proposed ensemble-based
approach significantly outperforms individual models and existing
state-of-the-art techniques. These findings demonstrate the potential of
ensemble methods in advancing brain signal decoding, offering new possibilities
for non-verbal communication applications, particularly in brain-computer
interface systems aimed at aiding individuals with speech impairments.
☆ Learning Hand State Estimation for a Light Exoskeleton
We propose a machine learning-based estimator of the hand state for
rehabilitation purposes, using light exoskeletons. These devices are easy to
use and useful for delivering domestic and frequent therapies. We build a
supervised approach using information from the muscular activity of the forearm
and the motion of the exoskeleton to reconstruct the hand's opening degree and
compliance level. Such information can be used to evaluate the therapy progress
and develop adaptive control behaviors. Our approach is validated with a real
light exoskeleton. The experiments demonstrate good predictive performance of
our approach when trained on data coming from a single user and tested on the
same user, even across different sessions. This generalization capability makes
our system promising for practical use in real rehabilitation.
☆ StreamAdapter: Efficient Test Time Adaptation from Contextual Streams
Dilxat Muhtar, Yelong Shen, Yaming Yang, Xiaodong Liu, Yadong Lu, Jianfeng Liu, Yuefeng Zhan, Hao Sun, Weiwei Deng, Feng Sun, Xueliang Zhang, Jianfeng Gao, Weizhu Chen, Qi Zhang
In-context learning (ICL) allows large language models (LLMs) to adapt to new
tasks directly from the given demonstrations without requiring gradient
updates. While recent advances have expanded context windows to accommodate
more demonstrations, this approach increases inference costs without
necessarily improving performance. To mitigate these issues, We propose
StreamAdapter, a novel approach that directly updates model parameters from
context at test time, eliminating the need for explicit in-context
demonstrations. StreamAdapter employs context mapping and weight absorption
mechanisms to dynamically transform ICL demonstrations into parameter updates
with minimal additional parameters. By reducing reliance on numerous in-context
examples, StreamAdapter significantly reduce inference costs and allows for
efficient inference with constant time complexity, regardless of demonstration
count. Extensive experiments across diverse tasks and model architectures
demonstrate that StreamAdapter achieves comparable or superior adaptation
capability to ICL while requiring significantly fewer demonstrations. The
superior task adaptation and context encoding capabilities of StreamAdapter on
both language understanding and generation tasks provides a new perspective for
adapting LLMs at test time using context, allowing for more efficient
adaptation across scenarios and more cost-effective inference
comment: 22 Pages, 9 Figures
☆ Cross-Modal Consistency in Multimodal Large Language Models
Xiang Zhang, Senyu Li, Ning Shi, Bradley Hauer, Zijun Wu, Grzegorz Kondrak, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan
Recent developments in multimodal methodologies have marked the beginning of
an exciting era for models adept at processing diverse data types, encompassing
text, audio, and visual content. Models like GPT-4V, which merge computer
vision with advanced language processing, exhibit extraordinary proficiency in
handling intricate tasks that require a simultaneous understanding of both
textual and visual information. Prior research efforts have meticulously
evaluated the efficacy of these Vision Large Language Models (VLLMs) in various
domains, including object detection, image captioning, and other related
fields. However, existing analyses have often suffered from limitations,
primarily centering on the isolated evaluation of each modality's performance
while neglecting to explore their intricate cross-modal interactions.
Specifically, the question of whether these models achieve the same level of
accuracy when confronted with identical task instances across different
modalities remains unanswered. In this study, we take the initiative to delve
into the interaction and comparison among these modalities of interest by
introducing a novel concept termed cross-modal consistency. Furthermore, we
propose a quantitative evaluation framework founded on this concept. Our
experimental findings, drawn from a curated collection of parallel
vision-language datasets developed by us, unveil a pronounced inconsistency
between the vision and language modalities within GPT-4V, despite its portrayal
as a unified multimodal model. Our research yields insights into the
appropriate utilization of such models and hints at potential avenues for
enhancing their design.
☆ Harnessing multiple LLMs for Information Retrieval: A case study on Deep Learning methodologies in Biodiversity publications
Deep Learning (DL) techniques are increasingly applied in scientific studies
across various domains to address complex research questions. However, the
methodological details of these DL models are often hidden in the unstructured
text. As a result, critical information about how these models are designed,
trained, and evaluated is challenging to access and comprehend. To address this
issue, in this work, we use five different open-source Large Language Models
(LLMs): Llama-3 70B, Llama-3.1 70B, Mixtral-8x22B-Instruct-v0.1, Mixtral 8x7B,
and Gemma 2 9B in combination with Retrieval-Augmented Generation (RAG)
approach to extract and process DL methodological details from scientific
publications automatically. We built a voting classifier from the outputs of
five LLMs to accurately report DL methodological information. We tested our
approach using biodiversity publications, building upon our previous research.
To validate our pipeline, we employed two datasets of DL-related biodiversity
publications: a curated set of 100 publications from our prior work and a set
of 364 publications from the Ecological Informatics journal. Our results
demonstrate that the multi-LLM, RAG-assisted pipeline enhances the retrieval of
DL methodological information, achieving an accuracy of 69.5% (417 out of 600
comparisons) based solely on textual content from publications. This
performance was assessed against human annotators who had access to code,
figures, tables, and other supplementary information. Although demonstrated in
biodiversity, our methodology is not limited to this field; it can be applied
across other scientific domains where detailed methodological reporting is
essential for advancing knowledge and ensuring reproducibility. This study
presents a scalable and reliable approach for automating information
extraction, facilitating better reproducibility and knowledge transfer across
studies.
☆ How Good is ChatGPT at Audiovisual Deepfake Detection: A Comparative Study of ChatGPT, AI Models and Human Perception
Multimodal deepfakes involving audiovisual manipulations are a growing threat
because they are difficult to detect with the naked eye or using unimodal deep
learningbased forgery detection methods. Audiovisual forensic models, while
more capable than unimodal models, require large training datasets and are
computationally expensive for training and inference. Furthermore, these models
lack interpretability and often do not generalize well to unseen manipulations.
In this study, we examine the detection capabilities of a large language model
(LLM) (i.e., ChatGPT) to identify and account for any possible visual and
auditory artifacts and manipulations in audiovisual deepfake content. Extensive
experiments are conducted on videos from a benchmark multimodal deepfake
dataset to evaluate the detection performance of ChatGPT and compare it with
the detection capabilities of state-of-the-art multimodal forensic models and
humans. Experimental results demonstrate the importance of domain knowledge and
prompt engineering for video forgery detection tasks using LLMs. Unlike
approaches based on end-to-end learning, ChatGPT can account for spatial and
spatiotemporal artifacts and inconsistencies that may exist within or across
modalities. Additionally, we discuss the limitations of ChatGPT for multimedia
forensic tasks.
☆ Automating Autograding: Large Language Models as Test Suite Generators for Introductory Programming
Automatically graded programming assignments provide instant feedback to
students and significantly reduce manual grading time for instructors. However,
creating comprehensive suites of test cases for programming problems within
automatic graders can be time-consuming and complex. The effort needed to
define test suites may deter some instructors from creating additional problems
or lead to inadequate test coverage, potentially resulting in misleading
feedback on student solutions. Such limitations may reduce student access to
the well-documented benefits of timely feedback when learning programming.
In this work, we evaluate the effectiveness of using Large Language Models
(LLMs), as part of a larger workflow, to automatically generate test suites for
CS1-level programming problems. Each problem's statement and reference solution
are provided to GPT-4 to produce a test suite that can be used by an
autograder. We evaluate our proposed approach using a sample of 26 problems,
and more than 25,000 attempted solutions to those problems, submitted by
students in an introductory programming course. We compare the performance of
the LLM-generated test suites against the instructor-created test suites for
each problem. Our findings reveal that LLM-generated test suites can correctly
identify most valid solutions, and for most problems are at least as
comprehensive as the instructor test suites. Additionally, the LLM-generated
test suites exposed ambiguities in some problem statements, underscoring their
potential to improve both autograding and instructional design.
comment: Submitted to Journal of Computer Assisted Learning
☆ Cross Space and Time: A Spatio-Temporal Unitized Model for Traffic Flow Forecasting
Predicting spatio-temporal traffic flow presents significant challenges due
to complex interactions between spatial and temporal factors. Existing
approaches often address these dimensions in isolation, neglecting their
critical interdependencies. In this paper, we introduce the Spatio-Temporal
Unitized Model (STUM), a unified framework designed to capture both spatial and
temporal dependencies while addressing spatio-temporal heterogeneity through
techniques such as distribution alignment and feature fusion. It also ensures
both predictive accuracy and computational efficiency. Central to STUM is the
Adaptive Spatio-temporal Unitized Cell (ASTUC), which utilizes low-rank
matrices to seamlessly store, update, and interact with space, time, as well as
their correlations. Our framework is also modular, allowing it to integrate
with various spatio-temporal graph neural networks through components such as
backbone models, feature extractors, residual fusion blocks, and predictive
modules to collectively enhance forecasting outcomes. Experimental results
across multiple real-world datasets demonstrate that STUM consistently improves
prediction performance with minimal computational cost. These findings are
further supported by hyperparameter optimization, pre-training analysis, and
result visualization. We provide our source code for reproducibility at
https://anonymous.4open.science/r/STUM-E4F0.
☆ Enhancing Financial Domain Adaptation of Language Models via Model Augmentation
The domain adaptation of language models, including large language models
(LLMs), has become increasingly important as the use of such models continues
to expand. This study demonstrates the effectiveness of Composition to Augment
Language Models (CALM) in adapting to the financial domain. CALM is a model to
extend the capabilities of existing models by introducing cross-attention
between two LLMs with different functions. In our experiments, we developed a
CALM to enhance the financial performance of an LLM with strong response
capabilities by leveraging a financial-specialized LLM. Notably, the CALM was
trained using a financial dataset different from the one used to train the
financial-specialized LLM, confirming CALM's ability to adapt to various
datasets. The models were evaluated through quantitative Japanese financial
benchmarks and qualitative response comparisons, demonstrating that CALM
enables superior responses with higher scores than the original models and
baselines. Additionally, comparative experiments on connection points revealed
that connecting the middle layers of the models is most effective in
facilitating adaptation to the financial domain. These findings confirm that
CALM is a practical approach for adapting LLMs to the financial domain.
☆ Towards Unified Neural Decoding of Perceived, Spoken and Imagined Speech from EEG Signals
Brain signals accompany various information relevant to human actions and
mental imagery, making them crucial to interpreting and understanding human
intentions. Brain-computer interface technology leverages this brain activity
to generate external commands for controlling the environment, offering
critical advantages to individuals with paralysis or locked-in syndrome. Within
the brain-computer interface domain, brain-to-speech research has gained
attention, focusing on the direct synthesis of audible speech from brain
signals. Most current studies decode speech from brain activity using invasive
techniques and emphasize spoken speech data. However, humans express various
speech states, and distinguishing these states through non-invasive approaches
remains a significant yet challenging task. This research investigated the
effectiveness of deep learning models for non-invasive-based neural signal
decoding, with an emphasis on distinguishing between different speech
paradigms, including perceived, overt, whispered, and imagined speech, across
multiple frequency bands. The model utilizing the spatial conventional neural
network module demonstrated superior performance compared to other models,
especially in the gamma band. Additionally, imagined speech in the theta
frequency band, where deep learning also showed strong effects, exhibited
statistically significant differences compared to the other speech paradigms.
☆ Programming with AI: Evaluating ChatGPT, Gemini, AlphaCode, and GitHub Copilot for Programmers
Our everyday lives now heavily rely on artificial intelligence (AI) powered
large language models (LLMs). Like regular users, programmers are also
benefiting from the newest large language models. In response to the critical
role that AI models play in modern software development, this study presents a
thorough evaluation of leading programming assistants, including ChatGPT,
Gemini(Bard AI), AlphaCode, and GitHub Copilot. The evaluation is based on
tasks like natural language processing and code generation accuracy in
different programming languages like Java, Python and C++. Based on the
results, it has emphasized their strengths and weaknesses and the importance of
further modifications to increase the reliability and accuracy of the latest
popular models. Although these AI assistants illustrate a high level of
progress in language understanding and code generation, along with ethical
considerations and responsible usage, they provoke a necessity for discussion.
With time, developing more refined AI technology is essential for achieving
advanced solutions in various fields, especially with the knowledge of the
feature intricacies of these models and their implications. This study offers a
comparison of different LLMs and provides essential feedback on the rapidly
changing area of AI models. It also emphasizes the need for ethical
developmental practices to actualize AI models' full potential.
comment: 8 pages
☆ Transferable Adversarial Attacks against ASR SP
Given the extensive research and real-world applications of automatic speech
recognition (ASR), ensuring the robustness of ASR models against minor input
perturbations becomes a crucial consideration for maintaining their
effectiveness in real-time scenarios. Previous explorations into ASR model
robustness have predominantly revolved around evaluating accuracy on white-box
settings with full access to ASR models. Nevertheless, full ASR model details
are often not available in real-world applications. Therefore, evaluating the
robustness of black-box ASR models is essential for a comprehensive
understanding of ASR model resilience. In this regard, we thoroughly study the
vulnerability of practical black-box attacks in cutting-edge ASR models and
propose to employ two advanced time-domain-based transferable attacks alongside
our differentiable feature extractor. We also propose a speech-aware gradient
optimization approach (SAGO) for ASR, which forces mistranscription with
minimal impact on human imperceptibility through voice activity detection rule
and a speech-aware gradient-oriented optimizer. Our comprehensive experimental
results reveal performance enhancements compared to baseline approaches across
five models on two databases.
comment: IEEE SPL
☆ Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering
Retrieval-augmented generation (RAG) has emerged as a promising approach to
enhance the performance of large language models (LLMs) in knowledge-intensive
tasks such as those from medical domain. However, the sensitive nature of the
medical domain necessitates a completely accurate and trustworthy system. While
existing RAG benchmarks primarily focus on the standard retrieve-answer
setting, they overlook many practical scenarios that measure crucial aspects of
a reliable medical system. This paper addresses this gap by providing a
comprehensive evaluation framework for medical question-answering (QA) systems
in a RAG setting for these situations, including sufficiency, integration, and
robustness. We introduce Medical Retrieval-Augmented Generation Benchmark
(MedRGB) that provides various supplementary elements to four medical QA
datasets for testing LLMs' ability to handle these specific scenarios.
Utilizing MedRGB, we conduct extensive evaluations of both state-of-the-art
commercial LLMs and open-source models across multiple retrieval conditions.
Our experimental results reveals current models' limited ability to handle
noise and misinformation in the retrieved documents. We further analyze the
LLMs' reasoning processes to provides valuable insights and future directions
for developing RAG systems in this critical medical domain.
☆ Dynamic Neural Communication: Convergence of Computer Vision and Brain-Computer Interface
Interpreting human neural signals to decode static speech intentions such as
text or images and dynamic speech intentions such as audio or video is showing
great potential as an innovative communication tool. Human communication
accompanies various features, such as articulatory movements, facial
expressions, and internal speech, all of which are reflected in neural signals.
However, most studies only generate short or fragmented outputs, while
providing informative communication by leveraging various features from neural
signals remains challenging. In this study, we introduce a dynamic neural
communication method that leverages current computer vision and brain-computer
interface technologies. Our approach captures the user's intentions from neural
signals and decodes visemes in short time steps to produce dynamic visual
outputs. The results demonstrate the potential to rapidly capture and
reconstruct lip movements during natural speech attempts from human neural
signals, enabling dynamic neural communication through the convergence of
computer vision and brain--computer interface.
comment: 4 pages, 2 figures, 1 table, Name of Conference: International
Conference on Brain-Computer Interface
☆ RibCageImp: A Deep Learning Framework for 3D Ribcage Implant Generation
The recovery of damaged or resected ribcage structures requires precise,
custom-designed implants to restore the integrity and functionality of the
thoracic cavity. Traditional implant design methods rely mainly on manual
processes, making them time-consuming and susceptible to variability. In this
work, we explore the feasibility of automated ribcage implant generation using
deep learning. We present a framework based on 3D U-Net architecture that
processes CT scans to generate patient-specific implant designs. To the best of
our knowledge, this is the first investigation into automated thoracic implant
generation using deep learning approaches. Our preliminary results, while
moderate, highlight both the potential and the significant challenges in this
complex domain. These findings establish a foundation for future research in
automated ribcage reconstruction and identify key technical challenges that
need to be addressed for practical implementation.
☆ Improvement and Implementation of a Speech Emotion Recognition Model Based on Dual-Layer LSTM
This paper builds upon an existing speech emotion recognition model by adding
an additional LSTM layer to improve the accuracy and processing efficiency of
emotion recognition from audio data. By capturing the long-term dependencies
within audio sequences through a dual-layer LSTM network, the model can
recognize and classify complex emotional patterns more accurately. Experiments
conducted on the RAVDESS dataset validated this approach, showing that the
modified dual layer LSTM model improves accuracy by 2% compared to the
single-layer LSTM while significantly reducing recognition latency, thereby
enhancing real-time performance. These results indicate that the dual-layer
LSTM architecture is highly suitable for handling emotional features with
long-term dependencies, providing a viable optimization for speech emotion
recognition systems. This research provides a reference for practical
applications in fields like intelligent customer service, sentiment analysis
and human-computer interaction.
☆ Dynamic technology impact analysis: A multi-task learning approach to patent citation prediction
Machine learning (ML) models are valuable tools for analyzing the impact of
technology using patent citation information. However, existing ML-based
methods often struggle to account for the dynamic nature of the technology
impact over time and the interdependencies of these impacts across different
periods. This study proposes a multi-task learning (MTL) approach to enhance
the prediction of technology impact across various time frames by leveraging
knowledge sharing and simultaneously monitoring the evolution of technology
impact. First, we quantify the technology impacts and identify patterns through
citation analysis over distinct time periods. Next, we develop MTL models to
predict citation counts using multiple patent indicators over time. Finally, we
examine the changes in key input indicators and their patterns over different
periods using the SHapley Additive exPlanation method. We also offer guidelines
for validating and interpreting the results by employing statistical methods
and natural language processing techniques. A case study on battery
technologies demonstrates that our approach not only deepens the understanding
of technology impact, but also improves prediction accuracy, yielding valuable
insights for both academia and industry.
☆ DeBaTeR: Denoising Bipartite Temporal Graph for Recommendation
Due to the difficulty of acquiring large-scale explicit user feedback,
implicit feedback (e.g., clicks or other interactions) is widely applied as an
alternative source of data, where user-item interactions can be modeled as a
bipartite graph. Due to the noisy and biased nature of implicit real-world
user-item interactions, identifying and rectifying noisy interactions are vital
to enhance model performance and robustness. Previous works on purifying
user-item interactions in collaborative filtering mainly focus on mining the
correlation between user/item embeddings and noisy interactions, neglecting the
benefit of temporal patterns in determining noisy interactions. Time
information, while enhancing the model utility, also bears its natural
advantage in helping to determine noisy edges, e.g., if someone usually watches
horror movies at night and talk shows in the morning, a record of watching a
horror movie in the morning is more likely to be noisy interaction. Armed with
this observation, we introduce a simple yet effective mechanism for generating
time-aware user/item embeddings and propose two strategies for denoising
bipartite temporal graph in recommender systems (DeBaTeR): the first is through
reweighting the adjacency matrix (DeBaTeR-A), where a reliability score is
defined to reweight the edges through both soft assignment and hard assignment;
the second is through reweighting the loss function (DeBaTeR-L), where weights
are generated to reweight user-item samples in the losses. Extensive
experiments have been conducted to demonstrate the efficacy of our methods and
illustrate how time information indeed helps identifying noisy edges.
☆ LEAP:D - A Novel Prompt-based Approach for Domain-Generalized Aerial Object Detection ICIP 2024
Drone-captured images present significant challenges in object detection due
to varying shooting conditions, which can alter object appearance and shape.
Factors such as drone altitude, angle, and weather cause these variations,
influencing the performance of object detection algorithms. To tackle these
challenges, we introduce an innovative vision-language approach using learnable
prompts. This shift from conventional manual prompts aims to reduce
domain-specific knowledge interference, ultimately improving object detection
capabilities. Furthermore, we streamline the training process with a one-step
approach, updating the learnable prompt concurrently with model training,
enhancing efficiency without compromising performance. Our study contributes to
domain-generalized object detection by leveraging learnable prompts and
optimizing training processes. This enhances model robustness and adaptability
across diverse environments, leading to more effective aerial object detection.
comment: ICIP 2024 Workshop accepted paper
☆ Gazing at Rewards: Eye Movements as a Lens into Human and AI Decision-Making in Hybrid Visual Foraging
Imagine searching a collection of coins for quarters ($0.25$), dimes
($0.10$), nickels ($0.05$), and pennies ($0.01$)-a hybrid foraging task where
observers look for multiple instances of multiple target types. In such tasks,
how do target values and their prevalence influence foraging and eye movement
behaviors (e.g., should you prioritize rare quarters or common nickels)? To
explore this, we conducted human psychophysics experiments, revealing that
humans are proficient reward foragers. Their eye fixations are drawn to regions
with higher average rewards, fixation durations are longer on more valuable
targets, and their cumulative rewards exceed chance, approaching the upper
bound of optimal foragers. To probe these decision-making processes of humans,
we developed a transformer-based Visual Forager (VF) model trained via
reinforcement learning. Our VF model takes a series of targets, their
corresponding values, and the search image as inputs, processes the images
using foveated vision, and produces a sequence of eye movements along with
decisions on whether to collect each fixated item. Our model outperforms all
baselines, achieves cumulative rewards comparable to those of humans, and
approximates human foraging behavior in eye movements and foraging biases
within time-limited environments. Furthermore, stress tests on
out-of-distribution tasks with novel targets, unseen values, and varying set
sizes demonstrate the VF model's effective generalization. Our work offers
valuable insights into the relationship between eye movements and
decision-making, with our model serving as a powerful tool for further
exploration of this connection. All data, code, and models will be made
publicly available.
☆ Advancing Diffusion Models: Alias-Free Resampling and Enhanced Rotational Equivariance
Recent advances in image generation, particularly via diffusion models, have
led to impressive improvements in image synthesis quality. Despite this,
diffusion models are still challenged by model-induced artifacts and limited
stability in image fidelity. In this work, we hypothesize that the primary
cause of this issue is the improper resampling operation that introduces
aliasing in the diffusion model and a careful alias-free resampling dictated by
image processing theory can improve the model's performance in image synthesis.
We propose the integration of alias-free resampling layers into the UNet
architecture of diffusion models without adding extra trainable parameters,
thereby maintaining computational efficiency. We then assess whether these
theory-driven modifications enhance image quality and rotational equivariance.
Our experimental results on benchmark datasets, including CIFAR-10, MNIST, and
MNIST-M, reveal consistent gains in image quality, particularly in terms of FID
and KID scores. Furthermore, we propose a modified diffusion process that
enables user-controlled rotation of generated images without requiring
additional training. Our findings highlight the potential of theory-driven
enhancements such as alias-free resampling in generative models to improve
image quality while maintaining model efficiency and pioneer future research
directions to incorporate them into video-generating diffusion models, enabling
deeper exploration of the applications of alias-free resampling in generative
modeling.
comment: 13 pages, 7 figures
☆ Towards Scalable Handwriting Communication via EEG Decoding and Latent Embedding Integration
In recent years, brain-computer interfaces have made advances in decoding
various motor-related tasks, including gesture recognition and movement
classification, utilizing electroencephalogram (EEG) data. These developments
are fundamental in exploring how neural signals can be interpreted to recognize
specific physical actions. This study centers on a written alphabet
classification task, where we aim to decode EEG signals associated with
handwriting. To achieve this, we incorporate hand kinematics to guide the
extraction of the consistent embeddings from high-dimensional neural recordings
using auxiliary variables (CEBRA). These CEBRA embeddings, along with the EEG,
are processed by a parallel convolutional neural network model that extracts
features from both data sources simultaneously. The model classifies nine
different handwritten characters, including symbols such as exclamation marks
and commas, within the alphabet. We evaluate the model using a quantitative
five-fold cross-validation approach and explore the structure of the embedding
space through visualizations. Our approach achieves a classification accuracy
of 91 % for the nine-class task, demonstrating the feasibility of fine-grained
handwriting decoding from EEG.
comment: 4 pages, 2 figures, 1 table, Name of Conference: International
Conference on Brain-Computer Interface
☆ Artificial Theory of Mind and Self-Guided Social Organisation
One of the challenges artificial intelligence (AI) faces is how a collection
of agents coordinate their behaviour to achieve goals that are not reachable by
any single agent. In a recent article by Ozmen et al this was framed as one of
six grand challenges: That AI needs to respect human cognitive processes at the
human-AI interaction frontier. We suggest that this extends to the AI-AI
frontier and that it should also reflect human psychology, as it is the only
successful framework we have from which to build out. In this extended abstract
we first make the case for collective intelligence in a general setting,
drawing on recent work from single neuron complexity in neural networks and ant
network adaptability in ant colonies. From there we introduce how species
relate to one another in an ecological network via niche selection, niche
choice, and niche conformity with the aim of forming an analogy with human
social network development as new agents join together and coordinate. From
there we show how our social structures are influenced by our neuro-physiology,
our psychology, and our language. This emphasises how individual people within
a social network influence the structure and performance of that network in
complex tasks, and that cognitive faculties such as Theory of Mind play a
central role. We finish by discussing the current state of the art in AI and
where there is potential for further development of a socially embodied
collective artificial intelligence that is capable of guiding its own social
structures.
comment: 4 pages
☆ Theory of Mind Enhances Collective Intelligence
Collective Intelligence plays a central role in a large variety of fields,
from economics and evolutionary theory to neural networks and eusocial insects,
and it is also core to much of the work on emergence and self-organisation in
complex systems theory. However, in human collective intelligence there is
still much more to be understood in the relationship between specific
psychological processes at the individual level and the emergence of
self-organised structures at the social level. Previously psychological factors
have played a relatively minor role in the study of collective intelligence as
the principles are often quite general and applicable to humans just as readily
as insects or other agents without sophisticated psychologies. In this article
we emphasise, with examples from other complex adaptive systems, the broad
applicability of collective intelligence principles while the mechanisms and
time-scales differ significantly between examples. We contend that flexible
collective intelligence in human social settings is improved by our use of a
specific cognitive tool: our Theory of Mind. We identify several key
characteristics of psychologically mediated collective intelligence and show
that the development of a Theory of Mind is a crucial factor distinguishing
social collective intelligence from general collective intelligence. We then
place these capabilities in the context of the next steps in artificial
intelligence embedded in a future that includes an effective human-AI hybrid
social ecology.
comment: 20 pages, 2 figures, 1 table
☆ Rationality based Innate-Values-driven Reinforcement Learning
Innate values describe agents' intrinsic motivations, which reflect their
inherent interests and preferences to pursue goals and drive them to develop
diverse skills satisfying their various needs. The essence of reinforcement
learning (RL) is learning from interaction based on reward-driven behaviors,
much like natural agents. It is an excellent model to describe the
innate-values-driven (IV) behaviors of AI agents. Especially developing the
awareness of the AI agent through balancing internal and external utilities
based on its needs in different tasks is a crucial problem for individuals
learning to support AI agents integrating human society with safety and harmony
in the long term. This paper proposes a hierarchical compound intrinsic value
reinforcement learning model -- innate-values-driven reinforcement learning
termed IVRL to describe the complex behaviors of AI agents' interaction. We
formulated the IVRL model and proposed two IVRL models: DQN and A2C. By
comparing them with benchmark algorithms such as DQN, DDQN, A2C, and PPO in the
Role-Playing Game (RPG) reinforcement learning test platform VIZDoom, we
demonstrated that rationally organizing various individual needs can
effectively achieve better performance.
comment: arXiv admin note: substantial text overlap with arXiv:2401.05572
☆ The \emph{Optimist}: Towards Fully Automated Graph Theory Research
This paper introduces the \emph{Optimist}, an autonomous system developed to
advance automated conjecture generation in graph theory. Leveraging
mixed-integer programming (MIP) and heuristic methods, the \emph{Optimist}
generates conjectures that both rediscover established theorems and propose
novel inequalities. Through a combination of memory-based computation and
agent-like adaptability, the \emph{Optimist} iteratively refines its
conjectures by integrating new data, enabling a feedback process with minimal
human (\emph{or machine}) intervention. Initial experiments reveal the
\emph{Optimist}'s potential to uncover foundational results in graph theory, as
well as to produce conjectures of interest for future exploration. This work
also outlines the \emph{Optimist}'s evolving integration with a counterpart
agent, the \emph{Pessimist} (a human \emph{or machine} agent), to establish a
dueling system that will drive fully automated graph theory research.
☆ ABCI 3.0: Evolution of the leading AI infrastructure in Japan
ABCI 3.0 is the latest version of the ABCI, a large-scale open AI
infrastructure that AIST has been operating since August 2018 and will be fully
operational in January 2025. ABCI 3.0 consists of computing servers equipped
with 6128 of the NVIDIA H200 GPUs and an all-flash storage system. Its peak
performance is 6.22 exaflops in half precision and 3.0 exaflops in single
precision, which is 7 to 13 times faster than the previous system, ABCI 2.0. It
also more than doubles both storage capacity and theoretical read/write
performance. ABCI 3.0 is expected to accelerate research and development,
evaluation, and workforce development of cutting-edge AI technologies, with a
particular focus on generative AI.
comment: 4 pages, 2 figures
☆ DROJ: A Prompt-Driven Attack against Large Language Models
Large Language Models (LLMs) have demonstrated exceptional capabilities
across various natural language processing tasks. Due to their training on
internet-sourced datasets, LLMs can sometimes generate objectionable content,
necessitating extensive alignment with human feedback to avoid such outputs.
Despite massive alignment efforts, LLMs remain susceptible to adversarial
jailbreak attacks, which usually are manipulated prompts designed to circumvent
safety mechanisms and elicit harmful responses. Here, we introduce a novel
approach, Directed Rrepresentation Optimization Jailbreak (DROJ), which
optimizes jailbreak prompts at the embedding level to shift the hidden
representations of harmful queries towards directions that are more likely to
elicit affirmative responses from the model. Our evaluations on LLaMA-2-7b-chat
model show that DROJ achieves a 100\% keyword-based Attack Success Rate (ASR),
effectively preventing direct refusals. However, the model occasionally
produces repetitive and non-informative responses. To mitigate this, we
introduce a helpfulness system prompt that enhances the utility of the model's
responses. Our code is available at
https://github.com/Leon-Leyang/LLM-Safeguard.
☆ VCBench: A Controllable Benchmark for Symbolic and Abstract Challenges in Video Cognition
Recent advancements in Large Video-Language Models (LVLMs) have driven the
development of benchmarks designed to assess cognitive abilities in video-based
tasks. However, most existing benchmarks heavily rely on web-collected videos
paired with human annotations or model-generated questions, which limit control
over the video content and fall short in evaluating advanced cognitive
abilities involving symbolic elements and abstract concepts. To address these
limitations, we introduce VCBench, a controllable benchmark to assess LVLMs'
cognitive abilities, involving symbolic and abstract concepts at varying
difficulty levels. By generating video data with the Python-based engine,
VCBench allows for precise control over the video content, creating dynamic,
task-oriented videos that feature complex scenes and abstract concepts. Each
task pairs with tailored question templates that target specific cognitive
challenges, providing a rigorous evaluation test. Our evaluation reveals that
even state-of-the-art (SOTA) models, such as Qwen2-VL-72B, struggle with simple
video cognition tasks involving abstract concepts, with performance sharply
dropping by 19% as video complexity rises. These findings reveal the current
limitations of LVLMs in advanced cognitive tasks and highlight the critical
role of VCBench in driving research toward more robust LVLMs for complex video
cognition challenges.
☆ Provocation: Who benefits from "inclusion" in Generative AI? NeurIPS 2024
The demands for accurate and representative generative AI systems means there
is an increased demand on participatory evaluation structures. While these
participatory structures are paramount to to ensure non-dominant values,
knowledge and material culture are also reflected in AI models and the media
they generate, we argue that dominant structures of community participation in
AI development and evaluation are not explicit enough about the benefits and
harms that members of socially marginalized groups may experience as a result
of their participation. Without explicit interrogation of these benefits by AI
developers, as a community we may remain blind to the immensity of systemic
change that is needed as well. To support this provocation, we present a
speculative case study, developed from our own collective experiences as AI
researchers. We use this speculative context to itemize the barriers that need
to be overcome in order for the proposed benefits to marginalized communities
to be realized, and harms mitigated.
comment: 3 pages, 1 figure. Published as a Short Paper in the NeurIPS 2024
Workshop on Evaluating Evaluations: Examining Best Practices for Measuring
Broader Impacts of Generative AI
☆ Heuristical Comparison of Vision Transformers Against Convolutional Neural Networks for Semantic Segmentation on Remote Sensing Imagery
Vision Transformers (ViT) have recently brought a new wave of research in the
field of computer vision. These models have done particularly well in the field
of image classification and segmentation. Research on semantic and instance
segmentation has emerged to accelerate with the inception of the new
architecture, with over 80\% of the top 20 benchmarks for the iSAID dataset
being either based on the ViT architecture or the attention mechanism behind
its success. This paper focuses on the heuristic comparison of three key
factors of using (or not using) ViT for semantic segmentation of remote sensing
aerial images on the iSAID. The experimental results observed during the course
of the research were under the scrutinization of the following objectives: 1.
Use of weighted fused loss function for the maximum mean Intersection over
Union (mIoU) score, Dice score, and minimization or conservation of entropy or
class representation, 2. Comparison of transfer learning on Meta's MaskFormer,
a ViT-based semantic segmentation model, against generic UNet Convolutional
Neural Networks (CNNs) judged over mIoU, Dice scores, training efficiency, and
inference time, and 3. What do we lose for what we gain? i.e., the comparison
of the two models against current state-of-art segmentation models. We show the
use of the novel combined weighted loss function significantly boosts the CNN
model's performance capacities as compared to transfer learning the ViT. The
code for this implementation can be found on
\url{https://github.com/ashimdahal/ViT-vs-CNN-ImageSegmentation}.
☆ NeuralDEM -- Real-time Simulation of Industrial Particulate Flows
Benedikt Alkin, Tobias Kronlachner, Samuele Papa, Stefan Pirker, Thomas Lichtenegger, Johannes Brandstetter
Advancements in computing power have made it possible to numerically simulate
large-scale fluid-mechanical and/or particulate systems, many of which are
integral to core industrial processes. Among the different numerical methods
available, the discrete element method (DEM) provides one of the most accurate
representations of a wide range of physical systems involving granular and
discontinuous materials. Consequently, DEM has become a widely accepted
approach for tackling engineering problems connected to granular flows and
powder mechanics. Additionally, DEM can be integrated with grid-based
computational fluid dynamics (CFD) methods, enabling the simulation of chemical
processes taking place, e.g., in fluidized beds. However, DEM is
computationally intensive because of the intrinsic multiscale nature of
particulate systems, restricting simulation duration or number of particles.
Towards this end, NeuralDEM presents an end-to-end approach to replace slow
numerical DEM routines with fast, adaptable deep learning surrogates. NeuralDEM
is capable of picturing long-term transport processes across different regimes
using macroscopic observables without any reference to microscopic model
parameters. First, NeuralDEM treats the Lagrangian discretization of DEM as an
underlying continuous field, while simultaneously modeling macroscopic behavior
directly as additional auxiliary fields. Second, NeuralDEM introduces
multi-branch neural operators scalable to real-time modeling of
industrially-sized scenarios - from slow and pseudo-steady to fast and
transient. Such scenarios have previously posed insurmountable challenges for
deep learning models. Notably, NeuralDEM faithfully models coupled CFD-DEM
fluidized bed reactors of 160k CFD cells and 500k DEM particles for
trajectories of 28s. NeuralDEM will open many new doors to advanced engineering
and much faster process cycles.
comment: Project page: https://nx-ai.github.io/NeuralDEM/
☆ Adopting RAG for LLM-Aided Future Vehicle Design
In this paper, we explore the integration of Large Language Models (LLMs)
with Retrieval-Augmented Generation (RAG) to enhance automated design and
software development in the automotive industry. We present two case studies: a
standardization compliance chatbot and a design copilot, both utilizing RAG to
provide accurate, context-aware responses. We evaluate four LLMs-GPT-4o,
LLAMA3, Mistral, and Mixtral -- comparing their answering accuracy and
execution time. Our results demonstrate that while GPT-4 offers superior
performance, LLAMA3 and Mistral also show promising capabilities for local
deployment, addressing data privacy concerns in automotive applications. This
study highlights the potential of RAG-augmented LLMs in improving design
workflows and compliance in automotive engineering.
comment: Conference paper accepted in IEEE FLLM 2024
☆ LEAP:D -- A Novel Prompt-based Approach for Domain-Generalized Aerial Object Detection ICIP 2024
Drone-captured images present significant challenges in object detection due
to varying shooting conditions, which can alter object appearance and shape.
Factors such as drone altitude, angle, and weather cause these variations,
influencing the performance of object detection algorithms. To tackle these
challenges, we introduce an innovative vision-language approach using learnable
prompts. This shift from conventional manual prompts aims to reduce
domain-specific knowledge interference, ultimately improving object detection
capabilities. Furthermore, we streamline the training process with a one-step
approach, updating the learnable prompt concurrently with model training,
enhancing efficiency without compromising performance. Our study contributes to
domain-generalized object detection by leveraging learnable prompts and
optimizing training processes. This enhances model robustness and adaptability
across diverse environments, leading to more effective aerial object detection.
comment: ICIP 2024 Workshop accepted paper
♻ ☆ Enhancing Maritime Trajectory Forecasting via H3 Index and Causal Language Modelling (CLM)
The prediction of ship trajectories is a growing field of study in artificial
intelligence. Traditional methods rely on the use of LSTM, GRU networks, and
even Transformer architectures for the prediction of spatio-temporal series.
This study proposes a viable alternative for predicting these trajectories
using only GNSS positions. It considers this spatio-temporal problem as a
natural language processing problem. The latitude/longitude coordinates of AIS
messages are transformed into cell identifiers using the H3 index. Thanks to
the pseudo-octal representation, it becomes easier for language models to learn
the spatial hierarchy of the H3 index. The method is compared with a classical
Kalman filter, widely used in the maritime domain, and introduces the Fr\'echet
distance as the main evaluation metric. We show that it is possible to predict
ship trajectories quite precisely up to 8 hours ahead with 30 minutes of
context, using solely GNSS positions, without relying on any additional
information such as speed, course, or external conditions - unlike many
traditional methods. We demonstrate that this alternative works well enough to
predict trajectories worldwide.
comment: 28 pages, 18 figures
♻ ☆ Quantitative Assessment of Intersectional Empathetic Bias and Understanding
A growing amount of literature critiques the current operationalizations of
empathy based on loose definitions of the construct. Such definitions
negatively affect dataset quality, model robustness, and evaluation
reliability. We propose an empathy evaluation framework that operationalizes
empathy close to its psychological origins. The framework measures the variance
in responses of LLMs to prompts using existing metrics for empathy and
emotional valence. The variance is introduced through the controlled generation
of the prompts by varying social biases affecting context understanding, thus
impacting empathetic understanding. The control over generation ensures high
theoretical validity of the constructs in the prompt dataset. Also, it makes
high-quality translation, especially into languages that currently have
little-to-no way of evaluating empathy or bias, such as the Slavonic family,
more manageable. Using chosen LLMs and various prompt types, we demonstrate the
empathy evaluation with the framework, including multiple-choice answers and
free generation. The variance in our initial evaluation sample is small and we
were unable to measure convincing differences between the empathetic
understanding in contexts given by different social groups. However, the
results are promising because the models showed significant alterations their
reasoning chains needed to capture the relatively subtle changes in the
prompts. This provides the basis for future research into the construction of
the evaluation sample and statistical methods for measuring the results.
♻ ☆ Lifted Inference beyond First-Order Logic
Weighted First Order Model Counting (WFOMC) is fundamental to probabilistic
inference in statistical relational learning models. As WFOMC is known to be
intractable in general ($\#$P-complete), logical fragments that admit
polynomial time WFOMC are of significant interest. Such fragments are called
domain liftable. Recent works have shown that the two-variable fragment of
first order logic extended with counting quantifiers ($\mathrm{C^2}$) is
domain-liftable. However, many properties of real-world data, like acyclicity
in citation networks and connectivity in social networks, cannot be modeled in
$\mathrm{C^2}$, or first order logic in general. In this work, we expand the
domain liftability of $\mathrm{C^2}$ with multiple such properties. We show
that any $\mathrm{C^2}$ sentence remains domain liftable when one of its
relations is restricted to represent a directed acyclic graph, a connected
graph, a tree (resp. a directed tree) or a forest (resp. a directed forest).
All our results rely on a novel and general methodology of "counting by
splitting". Besides their application to probabilistic inference, our results
provide a general framework for counting combinatorial structures. We expand a
vast array of previous results in discrete mathematics literature on directed
acyclic graphs, phylogenetic networks, etc.
comment: Under Review at the Artificial Intelligence Journal. Added two new
lemmas for counting by splitting in the Main approach section. Added
experiments with Markov Logic.arXiv admin note: text overlap with
arXiv:2302.09830
♻ ☆ Learning Multi-Agent Loco-Manipulation for Long-Horizon Quadrupedal Pushing
Yuming Feng, Chuye Hong, Yaru Niu, Shiqi Liu, Yuxiang Yang, Wenhao Yu, Tingnan Zhang, Jie Tan, Ding Zhao
Recently, quadrupedal locomotion has achieved significant success, but their
manipulation capabilities, particularly in handling large objects, remain
limited, restricting their usefulness in demanding real-world applications such
as search and rescue, construction, industrial automation, and room
organization. This paper tackles the task of obstacle-aware, long-horizon
pushing by multiple quadrupedal robots. We propose a hierarchical multi-agent
reinforcement learning framework with three levels of control. The high-level
controller integrates an RRT planner and a centralized adaptive policy to
generate subgoals, while the mid-level controller uses a decentralized
goal-conditioned policy to guide the robots toward these sub-goals. A
pre-trained low-level locomotion policy executes the movement commands. We
evaluate our method against several baselines in simulation, demonstrating
significant improvements over baseline approaches, with 36.0% higher success
rates and 24.5% reduction in completion time than the best baseline. Our
framework successfully enables long-horizon, obstacle-aware manipulation tasks
like Push-Cuboid and Push-T on Go1 robots in the real world.
♻ ☆ Equivariant Symmetry Breaking Sets
Equivariant neural networks (ENNs) have been shown to be extremely effective
in applications involving underlying symmetries. By construction ENNs cannot
produce lower symmetry outputs given a higher symmetry input. However, symmetry
breaking occurs in many physical systems and we may obtain a less symmetric
stable state from an initial highly symmetric one. Hence, it is imperative that
we understand how to systematically break symmetry in ENNs. In this work, we
propose a novel symmetry breaking framework that is fully equivariant and is
the first which fully addresses spontaneous symmetry breaking. We emphasize
that our approach is general and applicable to equivariance under any group. To
achieve this, we introduce the idea of symmetry breaking sets (SBS). Rather
than redesign existing networks, we design sets of symmetry breaking objects
which we feed into our network based on the symmetry of our inputs and outputs.
We show there is a natural way to define equivariance on these sets, which
gives an additional constraint. Minimizing the size of these sets equates to
data efficiency. We prove that minimizing these sets translates to a well
studied group theory problem, and tabulate solutions to this problem for the
point groups. Finally, we provide some examples of symmetry breaking to
demonstrate how our approach works in practice. The code for these examples is
available at \url{https://github.com/atomicarchitects/equivariant-SBS}.
comment: 50 pages, 19 figures Published in Transactions on Machine Learning
Research, October 2024
♻ ☆ FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, Olli Järviniemi, Matthew Barnett, Robert Sandler, Matej Vrzala, Jaime Sevilla, Qiuyu Ren, Elizabeth Pratt, Lionel Levine, Grant Barkley, Natalie Stewart, Bogdan Grechuk, Tetiana Grechuk, Shreepranav Varma Enugandla, Mark Wildon
We introduce FrontierMath, a benchmark of hundreds of original, exceptionally
challenging mathematics problems crafted and vetted by expert mathematicians.
The questions cover most major branches of modern mathematics -- from
computationally intensive problems in number theory and real analysis to
abstract questions in algebraic geometry and category theory. Solving a typical
problem requires multiple hours of effort from a researcher in the relevant
branch of mathematics, and for the upper end questions, multiple days.
FrontierMath uses new, unpublished problems and automated verification to
reliably evaluate models while minimizing risk of data contamination. Current
state-of-the-art AI models solve under 2% of problems, revealing a vast gap
between AI capabilities and the prowess of the mathematical community. As AI
systems advance toward expert-level mathematical abilities, FrontierMath offers
a rigorous testbed that quantifies their progress.
♻ ☆ Is Linear Feedback on Smoothed Dynamics Sufficient for Stabilizing Contact-Rich Plans? ICRA2025
Yuki Shirai, Tong Zhao, H. J. Terry Suh, Huaijiang Zhu, Xinpei Ni, Jiuguang Wang, Max Simchowitz, Tao Pang
Designing planners and controllers for contact-rich manipulation is extremely
challenging as contact violates the smoothness conditions that many
gradient-based controller synthesis tools assume. Contact smoothing
approximates a non-smooth system with a smooth one, allowing one to use these
synthesis tools more effectively. However, applying classical control synthesis
methods to smoothed contact dynamics remains relatively under-explored. This
paper analyzes the efficacy of linear controller synthesis using differential
simulators based on contact smoothing. We introduce natural baselines for
leveraging contact smoothing to compute (a) open-loop plans robust to uncertain
conditions and/or dynamics, and (b) feedback gains to stabilize around
open-loop plans. Using robotic bimanual whole-body manipulation as a testbed,
we perform extensive empirical experiments on over 300 trajectories and analyze
why LQR seems insufficient for stabilizing contact-rich plans. The video
summarizing this paper and hardware experiments is found here:
https://youtu.be/HLaKi6qbwQg?si=_zCAmBBD6rGSitm9.
comment: Under review for ICRA2025
♻ ☆ Knowledge Bases in Support of Large Language Models for Processing Web News
Large Language Models (LLMs) have received considerable interest in wide
applications lately. During pre-training via massive datasets, such a model
implicitly memorizes the factual knowledge of trained datasets in its hidden
parameters. However, knowledge held implicitly in parameters often makes its
use by downstream applications ineffective due to the lack of common-sense
reasoning. In this article, we introduce a general framework that permits to
build knowledge bases with an aid of LLMs, tailored for processing Web news.
The framework applies a rule-based News Information Extractor (NewsIE) to news
items for extracting their relational tuples, referred to as knowledge bases,
which are then graph-convoluted with the implicit knowledge facts of news items
obtained by LLMs, for their classification. It involves two lightweight
components: 1) NewsIE: for extracting the structural information of every news
item, in the form of relational tuples; 2) BERTGraph: for graph convoluting the
implicit knowledge facts with relational tuples extracted by NewsIE. We have
evaluated our framework under different news-related datasets for news category
classification, with promising experimental results.
comment: 10 pages, 5 figures
♻ ☆ Affordance-based Robot Manipulation with Flow Matching
We present a framework for assistive robot manipulation, which focuses on two
fundamental challenges: first, efficiently adapting large-scale models to
downstream scene affordance understanding tasks, especially in daily living
scenarios where gathering multi-task data involving humans requires strenuous
effort; second, effectively learning robot trajectories by grounding the visual
affordance model. We tackle the first challenge by employing a
parameter-efficient prompt tuning method that prepends learnable text prompts
to the frozen vision model to predict manipulation affordances in multi-task
scenarios. Then we propose to learn robot trajectories guided by affordances in
a supervised Flow Matching method. Flow matching represents a robot visuomotor
policy as a conditional process of flowing random waypoints to desired robot
trajectories. Finally, we introduce a real-world dataset with 10 tasks across
Activities of Daily Living to test our framework. Our extensive evaluation
highlights that the proposed prompt tuning method for learning manipulation
affordance with language prompter achieves competitive performance and even
outperforms other finetuning protocols across data scales, while satisfying
parameter efficiency. Learning multi-task robot trajectories with flow matching
policy also leads to consistently better generalization performance and faster
inference than alternative behavior cloning methods, especially given
multimodal robot action distributions. Our framework seamlessly unifies
affordance model learning and trajectory generation with flow matching for
robot manipulation.
♻ ☆ Can LLMs Recognize Toxicity? A Structured Investigation Framework and Toxicity Metric
In the pursuit of developing Large Language Models (LLMs) that adhere to
societal standards, it is imperative to detect the toxicity in the generated
text. The majority of existing toxicity metrics rely on encoder models trained
on specific toxicity datasets, which are susceptible to out-of-distribution
(OOD) problems and depend on the dataset's definition of toxicity. In this
paper, we introduce a robust metric grounded on LLMs to flexibly measure
toxicity according to the given definition. We first analyze the toxicity
factors, followed by an examination of the intrinsic toxic attributes of LLMs
to ascertain their suitability as evaluators. Finally, we evaluate the
performance of our metric with detailed analysis. Our empirical results
demonstrate outstanding performance in measuring toxicity within verified
factors, improving on conventional metrics by 12 points in the F1 score. Our
findings also indicate that upstream toxicity significantly influences
downstream metrics, suggesting that LLMs are unsuitable for toxicity
evaluations within unverified factors.
comment: 8 page long
♻ ☆ A Similarity-Based Oversampling Method for Multi-label Imbalanced Text Data
In real-world applications, as data availability increases, obtaining labeled
data for machine learning (ML) projects remains challenging due to the high
costs and intensive efforts required for data annotation. Many ML projects,
particularly those focused on multi-label classification, also grapple with
data imbalance issues, where certain classes may lack sufficient data to train
effective classifiers. This study introduces and examines a novel oversampling
method for multi-label text classification, designed to address performance
challenges associated with data imbalance. The proposed method identifies
potential new samples from unlabeled data by leveraging similarity measures
between instances. By iteratively searching the unlabeled dataset, the method
locates instances similar to those in underrepresented classes and evaluates
their contribution to classifier performance enhancement. Instances that
demonstrate performance improvement are then added to the labeled dataset.
Experimental results indicate that the proposed approach effectively enhances
classifier performance post-oversampling.
♻ ☆ IGUANe: a 3D generalizable CycleGAN for multicenter harmonization of brain MR images
In MRI studies, the aggregation of imaging data from multiple acquisition
sites enhances sample size but may introduce site-related variabilities that
hinder consistency in subsequent analyses. Deep learning methods for image
translation have emerged as a solution for harmonizing MR images across sites.
In this study, we introduce IGUANe (Image Generation with Unified Adversarial
Networks), an original 3D model that leverages the strengths of domain
translation and straightforward application of style transfer methods for
multicenter brain MR image harmonization. IGUANe extends CycleGAN by
integrating an arbitrary number of domains for training through a many-to-one
architecture. The framework based on domain pairs enables the implementation of
sampling strategies that prevent confusion between site-related and biological
variabilities. During inference, the model can be applied to any image, even
from an unknown acquisition site, making it a universal generator for
harmonization. Trained on a dataset comprising T1-weighted images from 11
different scanners, IGUANe was evaluated on data from unseen sites. The
assessments included the transformation of MR images with traveling subjects,
the preservation of pairwise distances between MR images within domains, the
evolution of volumetric patterns related to age and Alzheimer$'$s disease (AD),
and the performance in age regression and patient classification tasks.
Comparisons with other harmonization and normalization methods suggest that
IGUANe better preserves individual information in MR images and is more
suitable for maintaining and reinforcing variabilities related to age and AD.
Future studies may further assess IGUANe in other multicenter contexts, either
using the same model or retraining it for applications to different image
modalities. IGUANe is available at
https://github.com/RocaVincent/iguane_harmonization.git.
comment: 29 pages, 14 figures
♻ ☆ Optimizing Automatic Summarization of Long Clinical Records Using Dynamic Context Extension:Testing and Evaluation of the NBCE Method
Summarizing patient clinical notes is vital for reducing documentation
burdens. Current manual summarization makes medical staff struggle. We propose
an automatic method using LLMs, but long inputs cause LLMs to lose context,
reducing output quality especially in small size model. We used a 7B model,
open-calm-7b, enhanced with Native Bayes Context Extend and a redesigned
decoding mechanism to reference one sentence at a time, keeping inputs within
context windows, 2048 tokens. Our improved model achieved near parity with
Google's over 175B Gemini on ROUGE-L metrics with 200 samples, indicating
strong performance using less resources, enhancing automated EMR summarization
feasibility.
♻ ☆ Doob's Lagrangian: A Sample-Efficient Variational Approach to Transition Path Sampling NeurIPS 2024
Yuanqi Du, Michael Plainer, Rob Brekelmans, Chenru Duan, Frank Noé, Carla P. Gomes, Alán Aspuru-Guzik, Kirill Neklyudov
Rare event sampling in dynamical systems is a fundamental problem arising in
the natural sciences, which poses significant computational challenges due to
an exponentially large space of trajectories. For settings where the dynamical
system of interest follows a Brownian motion with known drift, the question of
conditioning the process to reach a given endpoint or desired rare event is
definitively answered by Doob's h-transform. However, the naive estimation of
this transform is infeasible, as it requires simulating sufficiently many
forward trajectories to estimate rare event probabilities. In this work, we
propose a variational formulation of Doob's h-transform as an optimization
problem over trajectories between a given initial point and the desired ending
point. To solve this optimization, we propose a simulation-free training
objective with a model parameterization that imposes the desired boundary
conditions by design. Our approach significantly reduces the search space over
trajectories and avoids expensive trajectory simulation and inefficient
importance sampling estimators which are required in existing methods. We
demonstrate the ability of our method to find feasible transition paths on
real-world molecular simulation and protein folding tasks.
comment: Accepted as Spotlight at Conference on Neural Information Processing
Systems (NeurIPS 2024); Alanine dipeptide results updated after fixing
unphysical parameterization
♻ ☆ ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting
Vision-language models (VLMs) have excelled in multimodal tasks, but adapting
them to embodied decision-making in open-world environments presents
challenges. One critical issue is bridging the gap between discrete entities in
low-level observations and the abstract concepts required for effective
planning. A common solution is building hierarchical agents, where VLMs serve
as high-level reasoners that break down tasks into executable sub-tasks,
typically specified using language. However, language suffers from the
inability to communicate detailed spatial information. We propose
visual-temporal context prompting, a novel communication protocol between VLMs
and policy models. This protocol leverages object segmentation from past
observations to guide policy-environment interactions. Using this approach, we
train ROCKET-1, a low-level policy that predicts actions based on concatenated
visual observations and segmentation masks, supported by real-time object
tracking from SAM-2. Our method unlocks the potential of VLMs, enabling them to
tackle complex tasks that demand spatial reasoning. Experiments in Minecraft
show that our approach enables agents to achieve previously unattainable tasks,
with a $\mathbf{76}\%$ absolute improvement in open-world interaction
performance. Codes and demos are now available on the project page:
https://craftjarvis.github.io/ROCKET-1.
♻ ☆ From Explicit Rules to Implicit Reasoning in an Interpretable Violence Monitoring System
Recently, research based on pre-trained models has demonstrated outstanding
performance in violence surveillance tasks. However, most of them were
black-box systems which faced challenges regarding explainability during
training and inference processes. An important question is how to incorporate
explicit knowledge into these implicit models, thereby designing expertdriven
and interpretable violence surveillance systems. This paper proposes a new
paradigm for weakly supervised violence monitoring (WSVM) called Rule base
Violence Monitoring (RuleVM). The proposed RuleVM uses a dual-branch structure
with different designs for images and text. One of the branches is called the
implicit branch, which uses only visual features for coarse-grained binary
classification. In this branch, image feature extraction is divided into two
channels: one responsible for extracting scene frames and the other focusing on
extracting actions. The other branch is called the explicit branch, which
utilizes language-image alignment to perform fine-grained classification. For
the language channel design in the explicit branch, the proposed RuleVM uses
the state-of-the-art YOLOWorld model to detect objects in video frames, and
association rules are identified through data mining methods as descriptions of
the video. Leveraging the dual-branch architecture, RuleVM achieves
interpretable coarse-grained and fine-grained violence surveillance. Extensive
experiments were conducted on two commonly used benchmarks, and the results
show that RuleVM achieved the best performance in both coarse-grained and
finegrained monitoring, significantly outperforming existing state-ofthe-art
methods. Moreover, interpretability experiments uncovered some interesting
rules, such as the observation that as the number of people increases, the risk
level of violent behavior also rises.
comment: 12 pages,7 figures IEEE TSMCA (Under review)
♻ ☆ Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques
Recently, the remarkable success of ChatGPT has sparked a renewed wave of
interest in artificial intelligence (AI), and the advancements in visual
language models (VLMs) have pushed this enthusiasm to new heights. Differring
from previous AI approaches that generally formulated different tasks as
discriminative models, VLMs frame tasks as generative models and align language
with visual information, enabling the handling of more challenging problems.
The remote sensing (RS) field, a highly practical domain, has also embraced
this new trend and introduced several VLM-based RS methods that have
demonstrated promising performance and enormous potential. In this paper, we
first review the fundamental theories related to VLM, then summarize the
datasets constructed for VLMs in remote sensing and the various tasks they
addressed. Finally, we categorize the improvement methods into three main parts
according to the core components of VLMs and provide a detailed introduction
and comparison of these methods. A project associated with this review has been
created at https://github.com/taolijie11111/VLMs-in-RS-review.
♻ ☆ Grounding is All You Need? Dual Temporal Grounding for Video Dialog
In the realm of video dialog response generation, the understanding of video
content and the temporal nuances of conversation history are paramount. While a
segment of current research leans heavily on large-scale pretrained
visual-language models and often overlooks temporal dynamics, another delves
deep into spatial-temporal relationships within videos but demands intricate
object trajectory pre-extractions and sidelines dialog temporal dynamics. This
paper introduces the Dual Temporal Grounding-enhanced Video Dialog model
(DTGVD), strategically designed to merge the strengths of both dominant
approaches. It emphasizes dual temporal relationships by predicting dialog
turn-specific temporal regions, filtering video content accordingly, and
grounding responses in both video and dialog contexts. One standout feature of
DTGVD is its heightened attention to chronological interplay. By recognizing
and acting upon the dependencies between different dialog turns, it captures
more nuanced conversational dynamics. To further bolster the alignment between
video and dialog temporal dynamics, we've implemented a list-wise contrastive
learning strategy. Within this framework, accurately grounded turn-clip
pairings are designated as positive samples, while less precise pairings are
categorized as negative. This refined classification is then funneled into our
holistic end-to-end response generation mechanism. Evaluations using
AVSD@DSTC-7 and AVSD@DSTC-8 datasets underscore the superiority of our
methodology.
♻ ☆ ClavaDDPM: Multi-relational Data Synthesis with Cluster-guided Diffusion Models
Recent research in tabular data synthesis has focused on single tables,
whereas real-world applications often involve complex data with tens or
hundreds of interconnected tables. Previous approaches to synthesizing
multi-relational (multi-table) data fall short in two key aspects: scalability
for larger datasets and capturing long-range dependencies, such as correlations
between attributes spread across different tables. Inspired by the success of
diffusion models in tabular data modeling, we introduce
$\textbf{C}luster$ $\textbf{La}tent$ $\textbf{Va}riable$ $guided$
$\textbf{D}enoising$ $\textbf{D}iffusion$ $\textbf{P}robabilistic$
$\textbf{M}odels$ (ClavaDDPM). This novel approach leverages clustering labels
as intermediaries to model relationships between tables, specifically focusing
on foreign key constraints. ClavaDDPM leverages the robust generation
capabilities of diffusion models while incorporating efficient algorithms to
propagate the learned latent variables across tables. This enables ClavaDDPM to
capture long-range dependencies effectively.
Extensive evaluations on multi-table datasets of varying sizes show that
ClavaDDPM significantly outperforms existing methods for these long-range
dependencies while remaining competitive on utility metrics for single-table
data.
♻ ☆ IRCAN: Mitigating Knowledge Conflicts in LLM Generation via Identifying and Reweighting Context-Aware Neurons NeurIPS 2024
It is widely acknowledged that large language models (LLMs) encode a vast
reservoir of knowledge after being trained on mass data. Recent studies
disclose knowledge conflicts in LLM generation, wherein outdated or incorrect
parametric knowledge (i.e., encoded knowledge) contradicts new knowledge
provided in the context. To mitigate such knowledge conflicts, we propose a
novel framework, IRCAN (Identifying and Reweighting Context-Aware Neurons) to
capitalize on neurons that are crucial in processing contextual cues.
Specifically, IRCAN first identifies neurons that significantly contribute to
context processing, utilizing a context-aware attribution score derived from
integrated gradients. Subsequently, the identified context-aware neurons are
strengthened via reweighting. In doing so, we steer LLMs to generate
context-sensitive outputs with respect to the new knowledge provided in the
context. Extensive experiments conducted across a variety of models and tasks
demonstrate that IRCAN not only achieves remarkable improvements in handling
knowledge conflicts but also offers a scalable, plug-and-play solution that can
be integrated seamlessly with existing models. Our codes are released at
https://github.com/danshi777/IRCAN.
comment: NeurIPS 2024
♻ ☆ An interpretable generative multimodal neuroimaging-genomics framework for decoding Alzheimer's disease
Giorgio Dolci, Federica Cruciani, Md Abdur Rahaman, Anees Abrol, Jiayu Chen, Zening Fu, Ilaria Boscolo Galazzo, Gloria Menegaz, Vince D. Calhoun
Alzheimer's disease (AD) is the most prevalent form of dementia with a
progressive decline in cognitive abilities. The AD continuum encompasses a
prodromal stage known as MCI, where patients may either progress to AD (MCIc)
or remain stable (MCInc). Understanding AD mechanisms requires complementary
analyses relying on different data sources, leading to the development of
multimodal DL models. We leveraged structural and functional MRI to investigate
the disease-induced GM and functional network connectivity changes. Moreover,
considering AD's strong genetic component, we introduced SNPs as a third
channel. Missing one or more modalities is a typical concern of multimodal
methods. We hence propose a novel DL-based classification framework where a
generative module employing Cycle GAN was adopted for imputing missing data in
the latent space. Additionally, we adopted an XAI method, Integrated Gradients,
to extract features' relevance, enhancing our understanding of the learned
representations. Two tasks were addressed: AD detection and MCI conversion
prediction. Experimental results showed that our framework reached the SOA in
the classification of CN/AD with an average test accuracy of $0.926\pm0.02$.
For the MCInc/MCIc task, we achieved an average prediction accuracy of
$0.711\pm0.01$ using the pre-trained model for CN and AD. The interpretability
analysis revealed that significant GM modulations led the classification
performance in cortical and subcortical brain areas well known for their
association with AD. Impairments in sensory-motor and visual functional network
connectivity along AD, as well as mutations in SNPs defining biological
processes linked to endocytosis, amyloid-beta, and cholesterol, were identified
as contributors to the results. Overall, our integrative DL model shows promise
for AD detection and MCI prediction, while shading light on important
biological insights.
comment: 28 pages, 8 figures, submitted to a journal
♻ ☆ Uncovering communities of pipelines in the task-fMRI analytical space
Analytical workflows in functional magnetic resonance imaging are highly
flexible with limited best practices as to how to choose a pipeline. While it
has been shown that the use of different pipelines might lead to different
results, there is still a lack of understanding of the factors that drive these
differences and of the stability of these differences across contexts. We use
community detection algorithms to explore the pipeline space and assess the
stability of pipeline relationships across different contexts. We show that
there are subsets of pipelines that give similar results, especially those
sharing specific parameters (e.g. number of motion regressors, software
packages, etc.). Those pipeline-to-pipeline patterns are stable across groups
of participants but not across different tasks. By visualizing the differences
between communities, we show that the pipeline space is mainly driven by the
size of the activation area in the brain and the scale of statistic values in
statistic maps.
comment: Accepted at the 2024 IEEE International Conference on Image
Processing
♻ ☆ A taxonomy of explanations to support Explainability-by-Design
As automated decision-making solutions are increasingly applied to all
aspects of everyday life, capabilities to generate meaningful explanations for
a variety of stakeholders (i.e., decision-makers, recipients of decisions,
auditors, regulators...) become crucial. In this paper, we present a taxonomy
of explanations that was developed as part of a holistic
'Explainability-by-Design' approach for the purposes of the project PLEAD. The
taxonomy was built with a view to produce explanations for a wide range of
requirements stemming from a variety of regulatory frameworks or policies set
at the organizational level either to translate high-level compliance
requirements or to meet business needs. The taxonomy comprises nine dimensions.
It is used as a stand-alone classifier of explanations conceived as detective
controls, in order to aid supportive automated compliance strategies. A
machinereadable format of the taxonomy is provided in the form of a light
ontology and the benefits of starting the Explainability-by-Design journey with
such a taxonomy are demonstrated through a series of examples.
♻ ☆ SM3-Text-to-Query: Synthetic Multi-Model Medical Text-to-Query Benchmark NeurIPS 2024
Electronic health records (EHRs) are stored in various database systems with
different database models on heterogeneous storage architectures, such as
relational databases, document stores, or graph databases. These different
database models have a big impact on query complexity and performance. While
this has been a known fact in database research, its implications for the
growing number of Text-to-Query systems have surprisingly not been investigated
so far. In this paper, we present SM3-Text-to-Query, the first multi-model
medical Text-to-Query benchmark based on synthetic patient data from Synthea,
following the SNOMED-CT taxonomy -- a widely used knowledge graph ontology
covering medical terminology. SM3-Text-to-Query provides data representations
for relational databases (PostgreSQL), document stores (MongoDB), and graph
databases (Neo4j and GraphDB (RDF)), allowing the evaluation across four
popular query languages, namely SQL, MQL, Cypher, and SPARQL. We systematically
and manually develop 408 template questions, which we augment to construct a
benchmark of 10K diverse natural language question/query pairs for these four
query languages (40K pairs overall). On our dataset, we evaluate several common
in-context-learning (ICL) approaches for a set of representative closed and
open-source LLMs. Our evaluation sheds light on the trade-offs between database
models and query languages for different ICL strategies and LLMs. Last,
SM3-Text-to-Query is easily extendable to additional query languages or real,
standard-based patient databases.
comment: NeurIPS 2024 Track Datasets and Benchmarks
♻ ☆ Toward Green and Human-Like Artificial Intelligence: A Complete Survey on Contemporary Few-Shot Learning Approaches
Despite deep learning's widespread success, its data-hungry and
computationally expensive nature makes it impractical for many data-constrained
real-world applications. Few-Shot Learning (FSL) aims to address these
limitations by enabling rapid adaptation to novel learning tasks, seeing
significant growth in recent years. This survey provides a comprehensive
overview of the field's latest advancements. Initially, FSL is formally
defined, and its relationship with different learning fields is presented. A
novel taxonomy is introduced, extending previously proposed ones, and
real-world applications in classic and novel fields are described. Finally,
recent trends shaping the field, outstanding challenges, and promising future
research directions are discussed.
comment: 35 pages, 9 figures. Submitted to ACM Computing Surveys
♻ ☆ Do Large Language Models Truly Grasp Mathematics? An Empirical Exploration From Cognitive Psychology
The cognitive mechanism by which Large Language Models (LLMs) solve
mathematical problems remains a widely debated and unresolved issue. Currently,
there is little interpretable experimental evidence that connects LLMs'
problem-solving with human cognitive psychology.To determine if LLMs possess
human-like mathematical reasoning, we modified the problems used in the human
Cognitive Reflection Test (CRT). Our results show that, even with the use of
Chains of Thought (CoT) prompts, mainstream LLMs, including the latest o1 model
(noted for its reasoning capabilities), have a high error rate when solving
these modified CRT problems. Specifically, the average accuracy rate dropped by
up to 50% compared to the original questions.Further analysis of LLMs'
incorrect answers suggests that they primarily rely on pattern matching from
their training data, which aligns more with human intuition (System 1 thinking)
rather than with human-like reasoning (System 2 thinking). This finding
challenges the belief that LLMs have genuine mathematical reasoning abilities
comparable to humans. As a result, this work may adjust overly optimistic views
on LLMs' progress towards artificial general intelligence.
♻ ☆ An improved tabular data generator with VAE-GMM integration
The rising use of machine learning in various fields requires robust methods
to create synthetic tabular data. Data should preserve key characteristics
while addressing data scarcity challenges. Current approaches based on
Generative Adversarial Networks, such as the state-of-the-art CTGAN model,
struggle with the complex structures inherent in tabular data. These data often
contain both continuous and discrete features with non-Gaussian distributions.
Therefore, we propose a novel Variational Autoencoder (VAE)-based model that
addresses these limitations. Inspired by the TVAE model, our approach
incorporates a Bayesian Gaussian Mixture model (BGM) within the VAE
architecture. This avoids the limitations imposed by assuming a strictly
Gaussian latent space, allowing for a more accurate representation of the
underlying data distribution during data generation. Furthermore, our model
offers enhanced flexibility by allowing the use of various differentiable
distributions for individual features, making it possible to handle both
continuous and discrete data types. We thoroughly validate our model on three
real-world datasets with mixed data types, including two medically relevant
ones, based on their resemblance and utility. This evaluation demonstrates
significant outperformance against CTGAN and TVAE, establishing its potential
as a valuable tool for generating synthetic tabular data in various domains,
particularly in healthcare.
comment: 7 pages, 3 figures
♻ ☆ More Expressive Attention with Negative Weights
We propose a novel attention mechanism, named Cog Attention, that enables
attention weights to be negative for enhanced expressiveness, which stems from
two key factors: (1) Cog Attention can shift the token deletion and copying
function from a static OV matrix to dynamic QK inner products, with the OV
matrix now focusing more on refinement or modification. The attention head can
simultaneously delete, copy, or retain tokens by assigning them negative,
positive, or minimal attention weights, respectively. As a result, a single
attention head becomes more flexible and expressive. (2) Cog Attention improves
the model's robustness against representational collapse, which can occur when
earlier tokens are over-squashed into later positions, leading to homogeneous
representations. Negative weights reduce effective information paths from
earlier to later tokens, helping to mitigate this issue. We develop
Transformer-like models which use Cog Attention as attention modules, including
decoder-only models for language modeling and U-ViT diffusion models for image
generation. Experiments show that models using Cog Attention exhibit superior
performance compared to those employing traditional softmax attention modules.
Our approach suggests a promising research direction for rethinking and
breaking the entrenched constraints of traditional softmax attention, such as
the requirement for non-negative weights.
♻ ☆ Dual-Segment Clustering Strategy for Hierarchical Federated Learning in Heterogeneous Wireless Environments
Pengcheng Sun, Erwu Liu, Wei Ni, Kanglei Yu, Xinyu Qu, Rui Wang, Yanlong Bi, Chuanchun Zhang, Abbas Jamalipour
Non-independent and identically distributed (Non- IID) data adversely affects
federated learning (FL) while heterogeneity in communication quality can
undermine the reliability of model parameter transmission, potentially
degrading wireless FL convergence. This paper proposes a novel dual-segment
clustering (DSC) strategy that jointly addresses communication and data
heterogeneity in FL. This is achieved by defining a new signal-to-noise ratio
(SNR) matrix and information quantity matrix to capture the communication and
data heterogeneity, respectively. The celebrated affinity propagation algorithm
is leveraged to iteratively refine the clustering of clients based on the newly
defined matrices effectively enhancing model aggregation in heterogeneous
environments. The convergence analysis and experimental results show that the
DSC strategy can improve the convergence rate of wireless FL and demonstrate
superior accuracy in heterogeneous environments compared to classical
clustering methods.
♻ ☆ STARFlow: Spatial Temporal Feature Re-embedding with Attentive Learning for Real-world Scene Flow 3DV 2025
Scene flow prediction is a crucial underlying task in understanding dynamic
scenes as it offers fundamental motion information. However, contemporary scene
flow methods encounter three major challenges. Firstly, flow estimation solely
based on local receptive fields lacks long-dependency matching of point pairs.
To address this issue, we propose global attentive flow embedding to match
all-to-all point pairs in both feature space and Euclidean space, providing
global initialization before local refinement. Secondly, there are deformations
existing in non-rigid objects after warping, which leads to variations in the
spatiotemporal relation between the consecutive frames. For a more precise
estimation of residual flow, a spatial temporal feature re-embedding module is
devised to acquire the sequence features after deformation. Furthermore,
previous methods perform poor generalization due to the significant domain gap
between the synthesized and LiDAR-scanned datasets. We leverage novel domain
adaptive losses to effectively bridge the gap of motion inference from
synthetic to real-world. Experiments demonstrate that our approach achieves
state-of-the-art performance across various datasets, with particularly
outstanding results on real-world LiDAR-scanned datasets. Our code is available
at https://github.com/O-VIGIA/StarFlow.
comment: This paper was renamed to:"SSRFlow: Semantic-aware Fusion with
Spatial Temporal Re-embedding for Real-world Scene Flow" [arXiv:2408.07825]
and was accepted in 3DV 2025
♻ ☆ The Roles of Generative Artificial Intelligence in Internet of Electric Vehicles
Hanwen Zhang, Dusit Niyato, Wei Zhang, Changyuan Zhao, Hongyang Du, Abbas Jamalipour, Sumei Sun, Yiyang Pei
With the advancements of generative artificial intelligence (GenAI) models,
their capabilities are expanding significantly beyond content generation and
the models are increasingly being used across diverse applications.
Particularly, GenAI shows great potential in addressing challenges in the
electric vehicle (EV) ecosystem ranging from charging management to
cyber-attack prevention. In this paper, we specifically consider Internet of
electric vehicles (IoEV) and we categorize GenAI for IoEV into four different
layers namely, EV's battery layer, individual EV layer, smart grid layer, and
security layer. We introduce various GenAI techniques used in each layer of
IoEV applications. Subsequently, public datasets available for training the
GenAI models are summarized. Finally, we provide recommendations for future
directions. This survey not only categorizes the applications of GenAI in IoEV
across different layers but also serves as a valuable resource for researchers
and practitioners by highlighting the design and implementation challenges
within each layer. Furthermore, it provides a roadmap for future research
directions, enabling the development of more robust and efficient IoEV systems
through the integration of advanced GenAI techniques.
comment: 25 Pages
♻ ☆ Towards Objective and Unbiased Decision Assessments with LLM-Enhanced Hierarchical Attention Networks
How objective and unbiased are we while making decisions? This work
investigates cognitive bias identification in high-stake decision making
process by human experts, questioning its effectiveness in real-world settings,
such as candidates assessments for university admission. We begin with a
statistical analysis assessing correlations among different decision points
among in the current process, which discovers discrepancies that imply
cognitive bias and inconsistency in decisions. This motivates our exploration
of bias-aware AI-augmented workflow that surpass human judgment. We propose
BGM-HAN, an enhanced Hierarchical Attention Network with Byte-Pair Encoding,
Gated Residual Connections and Multi-Head Attention. Using it as a backbone
model, we further propose a Shortlist-Analyse-Recommend (SAR) agentic workflow,
which simulate real-world decision-making. In our experiments, both the
proposed model and the agentic workflow significantly improves on both human
judgment and alternative models, validated with real-world data.
comment: Source code is available at: https://github.com/junhua/bgm-han
♻ ☆ LProtector: An LLM-driven Vulnerability Detection System
This paper presents LProtector, an automated vulnerability detection system
for C/C++ codebases driven by the large language model (LLM) GPT-4o and
Retrieval-Augmented Generation (RAG). As software complexity grows, traditional
methods face challenges in detecting vulnerabilities effectively. LProtector
leverages GPT-4o's powerful code comprehension and generation capabilities to
perform binary classification and identify vulnerabilities within target
codebases. We conducted experiments on the Big-Vul dataset, showing that
LProtector outperforms two state-of-the-art baselines in terms of F1 score,
demonstrating the potential of integrating LLMs with vulnerability detection.
comment: 5 pages, 4 figures. This is a preprint version of the article. The
final version will be published in the proceedings of the IEEE conference
♻ ☆ Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning
Key-Value (KV) caching is a common technique to enhance the computational
efficiency of Large Language Models (LLMs), but its memory overhead grows
rapidly with input length. Prior work has shown that not all tokens are equally
important for text generation, proposing layer-level KV cache compression to
selectively retain key information. Recognizing the distinct roles of attention
heads in generation, we propose HeadKV, a head-level KV cache compression
method, and HeadKV-R2, which leverages a novel contextual reasoning ability
estimation for compression. Our approach operates at the level of individual
heads, estimating their importance for contextual QA tasks that require both
retrieval and reasoning capabilities. Extensive experiments across diverse
benchmarks (LongBench, LooGLE), model architectures (e.g., Llama-3-8B-Instruct,
Mistral-7B-Instruct), and long-context abilities tests demonstrate that our
head-level KV cache compression significantly outperforms strong baselines,
particularly in low-resource settings (KV size = 64 & 128). Notably, our method
retains just 1.5% of the KV cache while achieving 97% of the performance of the
full KV cache on the contextual question answering benchmark.Codes are
available at https://github.com/FYYFU/HeadKV
comment: 18pages